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Preface 



The papers in this book present various viewpoints on the design and im- 
plementation of techniques for QoS engineering for Internet services. They were 
selected from more than 70 submissions to the 1st International workshop on 
“Quality of future Internet services” (QoHS) organized by COST Action 263. 
The main focus of the papers is on the creation, configuration and deployment 
of end-to-end services over a QoS assured Internet using the IntServ (Integrated 
Services) and DiffServ (Differentiated Services) models. The main technical pro- 
gramme was completed by two keynote talks: IETF Chair Fred Baker opened 
the workshop with a discussion on major Internet development directions and 
Andrew M. Odlyzko of AT&T Labs Research gave the closing talk on Internet 
charging issues. The presentation of papers was organised in 9 sessions. 

The emphasis of Session 1 is on an assessment of the essential building blocks 
for a QoS assured Internet, i.e., queueing and scheduling, which basically defines 
the space for end-to-end services. The papers of this session discuss the bounds 
we may expect from these building blocks, the issues of queueing and scheduling 
management, and the parameters we need to tune in a dynamic implementation. 

Flow control and congestion control cannot be considered without regard to 
the dominating impact of TCP. The keyword of Session 2 is, therefore, Internet- 
friendly adaptation. Four papers in this session are complementary and together 
present an emerging understanding of a basic optimal area for such adaptation. 

Session 3 - End-to-End - highlights an interesting opposition within the 
IntServ domain between two proposed generalizations of RSVP. A third paper 
provides a pragmatic discussion of the issue of realizing dynamic service level 
agreements in a DiffServ domain. 

Session 4 addresses objectives, cost and particular mechanisms for QoS rout- 
ing and traffic engineering in the Internet. These aspects are shown to be im- 
portant components in a global approach to the realization of QoS guarantees. 

The importance of QoS measurements and measurement-based QoS mech- 
anisms is fairly well understood. Three papers in Session 5 present some inter- 
esting developments in the fields of measurement-based analysis, measurement 
metrics and Internet QoS measurement methodology. 

The papers of Session 6 analyse different issues of fairness, already addressed 
in Session 1, and discuss such aspects as: fairness as a new research area, fair 
bandwidth allocation via marking-based flow layering, and fairness metrics. 

Adaptation is the focal point of Session 7 - from hybrid error correction, 
through dual (per-hop and end-to-end) optimal control, to active networks. Not 
surprisingly, two adaptation papers in this session deal with multicast which is 
another hot Internet topic. 

One of the basic questions - How to charge for quality classes? - is examined 
in the papers of Session 8. The issue is addressed from the viewpoints of quality 
classes provisioning, inter-domain pricing interworking, and provider revenue, 
especially for mobile communication. 




VI 



Preface 



Finally, the very traditional questions of resource utilisation and performance 
are revisited in Session 9 with emphasis on DiffServ networks. The papers of this 
session present novel approaches for live video scheduling, for inexpensive exper- 
imental QoS performance evaluation, and for lightweight resource management. 

The main track of the QofIS 2000 technical programme was accompanied 
by a mini-track on Internet Charging, represented in these proceedings by the 
abstracts of keynote talks and a panel. 

We wish to record our appreciation of the efforts of many people in bringing 
about the QofIS 2000 workshop: to the authors (we regret that it was not possible 
to accept more papers); to the Programme Committee and to all associated 
reviewers; to our sponsors, who made our job a bit easier; to the local Organising 
Committee, and particularly the Global Networking competence centre of GMD 
FOKUS; to all members of COST 263 Action who organised the first of what 
we hope will become a series of successful QofIS workshops. 



July 2000 Jon Crowcroft, Jim Roberts 

Michael Smirnov 
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Abstract. A large number of products implementing aggregate buffer- 
ing and scheduling mechanisms have been developed and deployed, and 
still more are under development. With the rapid increase in the demand 
for reliable end-to-end QoS solutions, it becomes increasingly important 
to understand the implications of aggregate scheduling on the resulting 
QoS capabilities. This paper studies the bounds on the worst case delay 
in a network implementing aggregate scheduling. We derive an upper 
bound on the queuing delay as a function of priority traffic utilization 
and the maximum hop count of any flow, and the shaping parameters 
at the network ingress. Our bound explodes at a certain utilization level 
which is a function of the hop count. We show that for a general network 
configuration and larger utilization utilization an upper bound on delay, 
if it exists, must be a function of the number of nodes and/or the number 
of flows in the network. 

Keywords: delay, jitter, aggregate scheduling, FIFO, priority queuing. 
Guaranteed Service, Diffserv 



1 Introduction and Background 

In the last decade, the problem of providing end-to-end QoS guarantees in In- 
tegrated Services networks has received a lot of attention. One of the major 
challenges in providing hard end-to-end QoS guarantees is the problem of seal- 
ability. Traditional approaches to providing hard end-to-end QoS guarantees, 
which involve per flow signaling, buffering and scheduling, are difficult, if not 
impossible, to deploy in large ISP networks. As a result, methods based on traf- 
fic aggregation have recently become widespread. A notable example of methods 
based on aggregation has been developed in the Differentiated Services Working 
group of IETF. In particular, RFC 2598 [1] defines Expedited Forwarding per 
Hop Behavior (EF PHB), the underlying principle of which is to ensure that at 
each hop the aggregate of traffic requiring EF PHB treatment receives service 
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rate exceeding the total bandwidth requirements of all flows in the aggregate at 
this hop. Recently, a lot of practical implementations of EF PHB have emerged 
where all EF traffic is shaped and policed at the backbone ingress, while in the 
core equipment all EF traffic shares a single priority FIFO or single high-weight 
queue in a Class-Based Fair Queuing scheduler. Since these implementations of- 
fer a very high degree of scalability at comparatively low price, they are naturally 
very attractive. 

Sufficient bandwidth must be available on any link in the network to accom- 
modate bandwidth demands of all individual flows requiring end-to-end QoS. 
One way of ensuring such bandwidth availability is by generously over provi- 
sioning the network capacity. Such over provisioning requires adequate methods 
for measuring and predicting traffic demands. While a lot of work is being done 
in this area, there remains a substantial concern that the absence of explicit 
bandwidth reservations may undermine the ability of the network to provide 
hard QoS guarantees. As a result, approaches based on explicit aggregate band- 
width reservations using RSVP aggregation are being proposed [2]. The use of 
RSVP aggregation allows the uncertainty in bandwidth availability to be over- 
come in a scalable manner. As a result, methods based on RSVP aggregation 
can provide hard end-to-end bandwidth guarantees to individual flows sharing 
the traffic aggregate with a particular aggregate RSVP reservation. 

However, regardless of whether bandwidth availability is ensured by over 
provisioning, or by explicit bandwidth reservations, there remains a problem of 
whether it is possible to provide not only bandwidth guarantees but also end- 
to-end latency and jitter guarantees for traffic requiring such guarantees, in the 
case when only aggregate scheduling is implemented in the core. 

An important example of traffic requiring very stringent delay guarantees 
is voice. A popular approach to supporting voice traffic is to use EF PHB and 
serve it at a high priority. The underlying intuition is that as long as the load 
of voice traffic is substantially smaller than the service rate of the voice queue, 
there will be no (or very little) queueing delay and hence very little latency and 
jitter. However, little work has been done in quantifying exactly how much voice 
traffic can actually be supported without violating stringent end-to-end delay 
budget that voice traffic requires. 

Another example when a traffic aggregate may require a stringent latency and 
jitter guarantee is the so-called Virtual Leased Line (VLL) [1]. The underlying 
idea of the VLL is to provide a customer with an emulation of a dedicated line of 
a given bandwidth over the IP infrastructure. One attractive property of the real 
dedicated line is that a customer who has traffic with diverse QoS requirements 
can perform local scheduling based on those requirements and local bandwidth 
allocation policies, and the end-to-end QoS experienced by his traffic will be 
entirely under the customers control. One way to provide a close emulation of 
this service over IP infrastructure could be to allocate a queue per each VLL in 
a WFQ scheduler at each router in the network. However, since there could be a 
very large number of VLLs in the core, per- VLL queuing and scheduling presents 
a substantial challenge. As a result, it is very attractive to use aggregate queuing 
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and scheduling for VLL traffic as well. However, aggregate buffering scheduling 
introduces substantially more jitter than per- VLL scheduling. This jitter results 
in substantial deviation from the hard pipe model of the dedicated link, and may 
cause substantial degradation of service as seen by delay-sensitive traffic inside a 
given VLL. Hence, understanding just how severe the degradation of service (as 
seen by delay sensitive traffic within a VLL) can be with aggregate scheduling 
is an important problem that has not been well understood. 

One of the goals of this document is to quantify the service guarantees that 
can be achieved with aggregate scheduling. There are two approaches that can 
be taken in pursuit of this goal. One is to analyze statistical behavior of the 
system and to quantify ’’statistical guarantees”. While a large amount of work 
has been done in this area, there is still an uncertainty about what the appropri- 
ate traffic model should be. In this respect, an interesting approach is presented 
by Brichet et. al. in [.3], where it is conjectured that the statistical behavior of 
periodic traffic entering a network of deterministic servers is ’’better” than the 
behavior of Poisson traffic. If this conjecture is true, then the implication would 
be that the queuing delay and resulting jitter of strictly periodic traffic would 
be no worse that the behavior of a tandem of M/D/1 servers. Another approach 
to quantifying delay guarantees it to analyze the deterministic worst case per- 
formance. Such analysis is of immediate importance in the context of providing 
end-to-end Guaranteed Service [4] across a network with aggregate scheduling, 
such as a Diffserv cloud implementing EF PHB. Since GS definition requires 
availability of a meaningful mathematical delay bounds, the ability to provide 
a deterministic delay bound across a Diffserv cloud would enable supporting 
end-to-end Guaranteed Service to a flow even if its path traverses a Diffserv 
domain. 

Understanding worst case delay bounds in a Diffserv cloud is also important 
in the context of supporting VoIP applications over a Diffserv network. Encoded 
voice traffic is periodic by nature, and voice generated by devices such as VoIP 
PBXs using hardware coders synchronized to network SONET clocks can addi- 
tionally be highly synchronized. The consequence of periodicity is that if a bad 
pattern occurs for one packet of a flow, it is likely to repeat itself periodically 
for packets of the same flow. Therefore, it is not just a random isolated packet 
that can be affected by a bad pattern, making it very unlikely that more than 
a very small number of packets of any given flow will be affected (and hence 
justifying the assumption that the users will never notice such a rare event) - 
rather, it is a random flow or set of flows that may see consistently bad per- 
formance for the duration of the call. As the amount of VoIP traffic increases, 
every now and then a bad case will happen, and even if does not happen that 
frequently, when it does occur, a set of users are most certainly likely to suffer 
potentially severe performance degradation. These considerations provide mo- 
tivation to take a closer look at the worst case behavior. On the other hand, 
the temporary nature of voice traffic (e.g. conversations starting and ending) 
disrupts strict periodicity of voice traffic. Furthermore, in practice it should be 
expected that aperiodic traffic may be sharing the same aggregate queues with 
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periodic traffic, (e.g. mission critical medical application sharing the same EF 
queue with voice flows). Hence, basing worst case results on the assumption of 
strict periodicity may yield erroneous results. 

There is a widespread belief, based largely on intuition, that as long as the uti- 
lization on any link is kept small enough (such as less than 50%), the worst case 
delay through the network will be very small. Unfortunately, this intuition leads 
to erroneous conclusions. Prior analytical work on studying the delay bounds 
in a network with aggregate scheduling indicates that networks with aggregate 
scheduling may have unbounded queuing delay even with aggregate output shap- 
ing [10]. On the other side of the spectrum, several researchers demonstrate that 
the ability to provide good delay bounds may depend on complex global condi- 
tions [11], [12]. In particular, this work assumes that individual flows are shaped 
at the network entry in such a way that the spacing between packets is at least 
equal to the so-called route interference number (RIN)^. It is shown that in 
this case the end-to-end result is bounded by the time to transmit a number of 
packets equal to the RIN. 

The contribution of this paper is twofold. First we show that for sufficiently 
low utilization factors, deterministic delay bounds can be obtained as a function 
of the bound on utilization of any link and the maximum hop count of any 
flow. Second, we argue that for larger utilization, for any given values of delay, 
hop count and utilization it is possible to construct a network in which the 
queuing delay exceeds the chosen delay value. This implies that for large enough 
utilization, either the bound does not exist at all, or if it does exist it must be 
a function of the size and possibly topology of the network. 

2 Model, Assumptions and Terminology 

We assume that all nodes in the network are output-buffered devices implement- 
ing aggregate class-based strict Priority Scheduling^. Traffic enters the network 
at some nodes (referred to as ingress edge) and exits at some other nodes (re- 
ferred to as egress edge). We consider at least 2 classes of edge-to-edge (e2e) flows, 
where a flow is some collection of packets with the common ingress and egress 
edge pair. We will refer to one of the classes as priority class. At each node in the 
network packets belonging to the priority class are queued in a separate priority 
queue which is served at strict non-preemptive priority over any other queue. 
The total traffic of all priority flows sharing a particular link is refereed to as a 
priority aggregate. We assume that each e2e priority flow F is shaped to conform 
to a leaky bucket with parameters (i?F, Bp) when it arrives at the ingress edge. 
Note that the flow can itself consist of a number of microflows sharing the same 
ingress-egress edge pair, but no assumption is made on how those microflows 
are shaped. Our model further allows that the aggregate of all flows entering at 

^ RIN is defined as the number of occurrences of a flow joining the path of some other 
flow. 

^ Our results easily extend to class-based WFQ schedulers as well. 
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a given ingress point, regardless of their egress edge, can be additionally aggre- 
gately shaped, but our results do not depend on such aggregate shaping. 

Let S{1) denote the set of all priority flows constituting the priority aggregate 
on link 1. We assume that the amount of priority traffic on any link does not 
exceed a certain ratio 0 < a < 1 of the capacity of any link. More specifically 
we require that for any link I in the network X)fsS(/) where C{1) is 

the capacity of link 1. We also require that we can bound the sum of the token 
bucket depths burst at any link relative to the link capacity, namely we can 
express some parameter r such that for all links I Bp < tC(1). In many 

cases, the depth of the leaky bucket of a flow depends linearly on the rate of the 
flow, such that Bp < tqRp for every flow F and for some tq. In such cases we set 
T = oTo . This paper makes no assumptions about the exact mechanism ensuring 
the conditions above - it can be based on explicit reservations such as described 
in [2], or some reliable provisioning technique. In addition, we assume that the 
route of any flow in the network traverses at most h nodes (also referred to as 
hops). Finally, we assume that there exists a bound on the size of any packet in 
the network, denoted MTU. 



3 A Delay Bound for a General Topology 



We can abstract our node description by saying that the priority aggregate at 
link I is guaranteed a service curve equal to ai{t) = (tC{l) — MTU)~^ [5, 6, 7, 8]. 
Also call r{l) a bound on the peak rate of all incoming high priority traffic at 
link 1. If we have no information about this peak rate, then F(l) = -|-oo. For 
a router with large internal speed and buffering only at the output, F(l) is the 
sum of the bit rates of all incoming links. The delay bound is better for a smaller 
F{1). Also, define u{l) = Note that u{l) increases with F{1), and if 

r{l) = - 1 - 00 , then u{l) = 1. Finally, let u = max; u{l) and A = max; . 

r(i) 

Theorem 1. If a < min; then a bound on the end-to-end 

delay for high priority traffic is 



D = 



h 

1 — (h — l)ua 



{A -\- ut) 



Proof The proof consists in showing separately that (1) if a finite bound exists, 
then the formula in Theorem 1 is true and (2) that a finite bound exists. 

(Part 1) We assume that a finite bound exists. Call D' the worst case delay 
across any node. Let ap{f) = tRp -\- Bp be the arrival curve for the fresh traffic 
of flow F at the network entry. Consider a buffer at some link 1. An arrival curve 
for the aggregate traffic at the input of link I is 



a{t) = min 



tF{l), ap{t+{h-l)D') 
FeS{i) 



6 



Anna Charny and Jean-Yves Le Boudec 



The former part in the formula is because the total incoming bit rate is limited 
by r{l); the latter is because any flow reaching that node has undergone a delay 
bounded by {h — 1)D'. Thus 

a{t) < a'{t) = min {tr{l),taC{l) + h') 

with b' = tC{ 1) + aC{l){h — 1)D' . A bound on the delay at our node is given 
by the horizontal deviation between the arrival curve a'(t) and the service curve 
ai{t) = {tC{l) — MTU)'^ [5, 6, 7, 8]. After some algebra, this gives a bound on the 
delay as . Now D' is the worst case delay; pick ^ to be a node where 

the worst case delay D' is attained. We must have D' < ■ Thus, we 

must have 

D' < A + UT + [h — l)uaD' 

or equivalently 



{I — (h — l)rta) D' < A + UT 



( 1 ) 



The condition on a means that a < Equation (1) implies then that 



D' < Di 



A + UT 
I — (h — l)ua 



The end-to-end delay is thus bounded by hDi, which after some algebra provides 
the required formula. 

(Part 2) We now prove that a finite bound exists. We use the time-stopping 
method in [13]. For any time t > 0, consider the virtual system made of the 
original network, where all sources are stopped at time t. This network satisfies 
the assumptions of part 1, since there is only a finite number of bits at the input. 
Call D'{t) the worst case delay across all nodes for the virtual network indexed 
by t. From the above derivation we see that D'{t) < Di for all t. Letting t tend 
to -|-oo shows that the worst case delay at any node remains bounded by D\ □ 



Discussion: If T(Z) = -|-oo, namely if we have no information about the sum of 
the bit rates of all incoming links, then the condition becomes a < and the 
bound is 

n - r ^ + 

- {h-l)a 

For finite values of r{l), the bound is smaller. Table 1 illustrates the 
value of our bound on one example. Our bound explodes when a tends to 
minj I which is very unfortunate. It is not clear whether the 

delay does become unbounded in all cases where our bound is infinite. Explo- 
sion of bounds for aggregate scheduling is not new, and has been encountered 
for example in [9,10]. In [10], an example network is given where the worst case 
delay becomes unbounded for a value of a close to 1. Note that if r{l) is close to 
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Table 1. The bound D (in ms) in this paper for h = 10. Bp = lOOB for 
all flows, Rp = 32kb/s for all flows, C{1) = 149.760Mb/s and r{l) = +oo or 
r{l) = 2C{1) 



a 


0.04 


0.08 


0.12 


0.16 


0.20 


D{r{i) = -too) 


16.88 


74.29 


+00 


-too 


-too 


D ini) = 2C(0) 


12.75 


32.67 


71.50 


186.00 


-too 



C{1) for all I, then u is close to 0. This is the case for a high speed add-drop ring 
network, where the bit rate is much higher for the ring than for the input links. 
In that case, D = h{A -h ut) + o{u) and the explosion does not take place. It is 
also known that, in practice, if some restriction on the topology, routes or rates 
are imposed, better worst case delays for FIFO networks can be obtained. For 
example, it can be shown that for some topologies, such as multistage networks, 
the worst case delay is given by ^ &max and i?min are 

the upper bound on the token bucket depth and the lower bound on the rate 
of all ingress leaky bucket shapers [14]. In this case the explosion does not take 
place. It is shown in [11,12], that if individual flows are shaped at network entry 
in such a way that the spacing between packets is at least equal to the route 
interference number (RIN) of the flow (see footnote 1 p.4), then the end-to-end 
delay is bounded by the time to transmit a number of packets equal to the RIN. 
This usually results in much smaller bounds than the bounds here, albeit at the 
expense of more restrictions on the routes and the rates. 

4 Delay with Larger Utilization 

We will now argue that if utilization factor a > , then the worst case delay 

bound, if it exists, must depend on some network parameter describing the size 
of the network or its topology. More precisely, it will be shown that for any value 
of the delay D, and for any a > there exists a large enough network 

such that the delay of some packet can exceed D, even if the maximum hop 
count of any flow never exceeds h. Since k can be chosen to be arbitrarily large, 
this implies that for any a > 7 ;^, and for any value of delay D, there exists a 
network with maximum hop count h where the delay of some packet exceeds D. 

To see how this network can be built, consider the following iterative con- 
struction, shown in Figure 1. The underlying idea of the example is to construct 
a hierarchical network, in which each flow traverses exactly two levels of the 
hierarchy. At the first stage of the hierarchy we construct a network, where one 
flow initially ideally spaced at its rate traverses h — 1 hops. This chosen flow will 
be referred to as “transit flow”. At each of these hops the flow shares the link 
with some number of flows which traverse this hop, but exit the network at the 
next hop. All of these flows come from a separate input links, and are referred 
to as ’’cross-traffic”. All these flows send exactly one packet and exit. The con- 
figured rates of all flows are the same, and the number N of flows sharing each 
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hop is such that the utilization constraint is satisfied. The first packet of the 
fiow that traverses all /i — 1 hops encounters one packet of each of these flows, so 
that the total delay experienced by this packet over h hops is equal to the inter- 
arrival time between the two consequent packets. Since the second packet is not 
delayed, there are two back-to-back packets at the exit from hop h — 1. In the 
second stage, we consider the same topology, where the cross-traffic consists of 
the same number of flows as at stage 1, except these flows have been the transit 
flows of the corresponding level 1 network, and hence they deliver 2 back-to-back 
packets each. The transit fiow of this level of the hierarchy starts ideally shaped 
as before, but encounters twice the number of packets compared to level 1, and 
hence accumulates 3 back-to-back packets over h — 1 hops. Repeating this for a 
sufficiently large number of times, we can obtain a fiow which accumulates any 
given number of packets per burst we may choose. Note that in this example 
no link, the maximum utilization factor for every link is bounded from above 
by a number less than 1, all flows traverse exactly h hops. More precisely, the 
transit fiow at iteration i traverses h—1 hops at this iteration, and one more hop 
at iteration i + 1, and the cross-traffic traverses h—1 hops at iteration z — 1, and 
one hop at iteration i. 

More detail can be found on Figure 1. The basic building block of the con- 
struction is a chain oi h — 1 nodes. The nodes are indexed where i is 

the iteration count and j is the count of the nodes in the chain. At any itera- 
tion i, there is a source s(z,0) which is attached directly to node n(z,l). Each 
of the nodes = 1,2, ...h — 2 also have kh input links each, and all nodes 

n{i,j),j = 2, ...h — 1 have hk output links. Each of the kh input links of itera- 
tion i are connected to node n{i — 1, h — 1) of a, network constructed at iteration 
i — 1 (the networks constructed at iteration i — 1 are denoted S{i — l,j, m), 
where j is the index of the node in the chain, and m is the index of the input 
to node n{i,j). Each of the kh output links corresponding to each node n(i,j) 
is connected to kh destination nodes, denoted D{i,j,m). 

At iteration i = 1, each of the networks S{0,j,m) (i.e. each of the black 
circles) is just a single source node. Assume that all packets are the same size, 
and that all the sources in the network are leaky bucket constrained with rates 
r = and burst (depth) equal to one packet. Assume that a fiow originating 
at source S{0,j,m) enters at node n{i,j) of the chain, traverses a single hop 
of the chain and exits at node n{i,j + 1) to destination D{0,j + l,m). Assume 
further that a fiow originating at source s(l,0) traverses the whole chain. Note 
that the utilization of any link in the chain does not exceed a . Assume that 
the propagation time of any link is negligible (it will become clear that this 
assumption does not lead to any loss of generality). Assume also that all links 
are of the same speed, and that the time to transmit the packet at link speed is 
chosen as unit of time. 

At iteration z = 1 at time t = 0 we make a single packet fully arrive to 
node n(l, 1) from each of the sources 5'(1, l,m). Let’s color these packets blue. 
Assume also that at time zero a red packet started its transmission from node 
5'(1,0) and fully arrived to n{l, 1) at time ti = 1. By this time one blue packet 
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sa-i;LA) S&-1A3) S&-133) 




D 03,1) D 033) D 0,4,1) D 0,43) 

Fig. 1. The iterative network topology of iteration i for h = 5. Black circles 
represent networks constructed at iteration i — 1. Only 3 out of kh black circles 
are shown per each node n{i,j). The output of node in iteration i — \ 

denoted by the arrow point is connected to the input of one of the black circles 
of iteration i 



of the hk blue packets which arrived at time t = 0 has departed from n{i, 1), 
and therefore the red packet has found a queue of length kh — 1 blue packets 
ahead of it. By time t 2 = ti + kh the red packet will be fully transmitted from 
node n(l, 1). Assume that at time t 2 — 1 a, single blue packet fully arrives from 
each of the nodes S'(l, 2, m), so that the red packet finds a queue of size kh — 1 
at the second node in the chain as well (recall that all the blue packets from the 
previous node exit the network at node n(l, 2). 

Repeating this process at each of the nodes in the chain, it is easy to see 
that the red packet will suffer the total queueing delay of {h— l){hk — 1) packet- 
times at link speed as it traverses the chain. Since the rate of the red flow 
is r ^ (h-iKkh-i) ’ it means that in {h - l){hk - 1) packet 

times at link speed C at least one more red packet can be transmitted from the 
source S'(1,0). Since there are no other blue packets arriving to the chain by 
construction^, the second red packet will catch up with the first red packet, and 
hence the burst of two red packets will accumulate at the exit of node n{l, h—1). 
Assume that the red flow at the first iteration just sends two packets and exits. 

The next step of the iterative construction will take {h — l)hk of the networks 
constructed in step 1 and connect the output of node n{l,h — 1) of each of 
these networks (node n(l,4) in Fig. 1) to one of the nodes n{2,j) of the second 
iteration. That is, node n{i — l,h — l) becomes the source S{i — 1, j,m) of the 
chain at the next iteration. Let T(l) be the time it takes to accumulate the two- 
packet burst at the exit from node n(l, h—1) at iteration 1. Let then T(l) be also 
the time when the first bit of the two-packet burst arrives at node n(2, 1) from 
each of the nodes S' (1, 1, to) at iteration 2. Let’s recolor all those packets blue. If 

® Note that the fact that blue sources are leaky bucket constrained with rate r does 
not prevent the possibility that they are sending at rate slower than r. In particular, 
each blue source can send just a single packet and then stop without violating any 
of the constraints. 
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the source 5'(2,0) emits a red packet at time ti = T(l) + 1, then it fully arrives 
at node n(2, 1) at time t 2 = T(l) + 2, and finds the queue equal to 2/ifc — 1 blue 
packets (since 2hk packets have arrived and one has left in the previous time 
slot. That means that the last bit of the red packet will leave node n(2, 1) at 
time T(l) + 2 + 2kh. Assume now that the level-1 networks connected to node 
n(2, 2) are started at time 2kh, instead of at time zero. This means that by the 
time the last bit of our red packet in the second iteration arrives to node n{2, 2), 
a two-packet (blue) burst will arrive to node n{2, 2) from each of the networks 
^(l, 2, m), and so the red packet will again find the queue of size 2kh — 1 blue 
packets. Repeating this process for all h — 1 nodes in the chain at iteration 2, 
we make the red packet delayed by {2kh — l)(/i — 1) > 2{kh — l){h — 1) packet 
times. During this time 2 more red packets will catch up, resulting in a 3-packet 
burst. Assume that at iteration 2 the red flow sends 3 packets and exits. 

Repeating this process as many times as necessary, with the red flow sending 
i+1 packets at iteration i and accumulating the burst of size i + 1, we can grow 
a burst of red packets at least by one at each new iteration. Therefore, for any 
value of the delay D, we can always build a network of enough iterations, so that 
the burst B accumulated by the red flow over h — 1 hops is such that B/C > D. 
Hence, taking one more iteration, it is easy to see that the red packet at the 
next iteration will be delayed by more than D at the first hop. 

We emphasize that at each iteration the red flow traverses h — 1 hop, and 
then one more hop as it is recolored blue in the next iteration. Hence, no flow 
in the network ever traverses more than h hops. 

An interesting implications of this example is that the delay of a packet of a 
flow may be affected by flows which do not share any links in common with the 
given flow, and what is even more surprising, those flow had exited the network 
long before the flow to which the packet belongs has entered. 

5 Discussion 

The main difficulty in obtaining hard delay bounds for an arbitrary topology 
arises from the fact that in the case of aggregate scheduling, delay experienced 
by a single packet depends not only on the behavior of the flows that share 
at least one queue with the packet in question, but also on the behavior and 
arrival pattern of flows in the entire network, potentially much before the flow 
to which our packet belongs enters the network. As shown in the example in 
the previous section, a packet can encounter accumulated bursts of traffic that 
was initially ideally smooth at the entry, which in turn causes further burst 
accumulation of the considered flow. In turn, the new burst may contribute to 
further cascading jitter accumulation of other flows whose packets it encounters 
downstream. Intuitively, the lower the utilization factor and the shorter the route 
of the flow, the fewer chances a packet of that flow has in encountering other 
bursts, accumulating its own, and adversely affecting other traffic downstream. 
Our work has provided some quantification of this intuition. 
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The ability to provide hard delay bounds in a general topology network with 
aggregate scheduling must rely on cooperation of all devices in the cloud, as well 
as strict constraints on the traffic entering the cloud. It has been demonstrated 
that the global control of the following parameters global to the network is 
essential for the ability to provide strict queuing delay guarantees in such a 
network: 

1. limited number of hops of any flow across the network 

2. low (bounded) ratio of the load of priority traffic to the service rate of the 
priority queue on any link in the network 

3. strict constraints on the quality and granularity of ingress shaping 

While this has been intuitively understood for a long time, our contribution 
is to quantify exactly how these parameters affect the delay bound. In particular, 
barring new results on delay bounds, our results imply that in the absence of 
any other constraints on the topology or routes, priority traffic utilization must 
be much lower than previously assumed if strong delay and jitter guarantees 
are required. For the typical case of leaky bucket constrained flows our delay 
bound is linear in the bound on the ratio across all priority flows that are 
individually shaped at the ingress. Hence, there is an immediate implication that 
the smaller this ratio, the better the bounds can be provided. While Bp depends 
primarily on the accuracy of the shaper, Rp depends on the granularity of shaped 
flows. In the case of microflow shaping, the minimal rate of the microflow can be 
quite small, resulting in a large delay bound. There is a substantial advantage 
therefore in aggregating many small microflows into an aggregate and shaping 
the aggregate as a whole. While in principle there is a range of choices for 
aggregation, there are two natural choices: edge-to-edge aggregates or edge-to- 
e very where aggregates. Edge-to-edge shaping is natural for explicit bandwidth 
reservation techniques. In this case Rp and Bp relate to the rate and token 
bucket depth of the edge-to-edge flow aggregates. As discussed above, shaping 
the total edge-to-edge aggregates allows reducing the delay bound linearly with 
the degree of aggregation. Edge-to-every where aggregate shaping is frequently 
assumed in conjunction with bandwidth provisioning. 

The effect of this choice on delay bounds depends on exactly how provi- 
sioning is done. One possibility for provisioning the network is to estimate the 
edge-to-edge demand matrix for priority traffic and ensure that there is suffi- 
cient capacity to accommodate this demand, assuming that the traffic matrix is 
accurate enough. Another option is to make no assumption on the edge-to-edge 
priority traffic matrix, but rather to admit a certain amount of priority traffic 
at each ingress edge, regardless of the destination, and provision the network 
in such way that even if all traffic from all sources happens to pass through a 
single bottleneck link, the capacity of that link is sufficient to ensure that prior- 
ity traffic utilization does not exceed a. Depending on which of the two choices 
for provisioning is chosen, shaping of the edge-to-everywhere aggregate has the 
opposite effect on the delay bound. In the case of ’’edge-to-edge provisioning”, 
the bandwidth of any link may be sufficient to accommodate the actual load of 
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priority traffic while remaining within the target utilization bound. Hence, it is 
the minimal rate and the maximum burst size of the actual edge-to-edge aggre- 
gates sharing any link that effect the delay bound. However, aggregate edge-to- 
all shaping may result in individual substream of the shaped edge-to-everywhere 
aggregate being shaped to a much higher rate than the expected rate of that sub- 
stream. When the edge-to-everywhere aggregate splits inside the network into 
different substreams going to different destinations, each of those substreams 
may have in the worst case substantially larger token bucket depth than that of 
the aggregate edge-to-everywhere stream^. This results in substantial increase of 
the worst case delay over the edge-to-edge shaping model. Moreover, in this case 
the properties of ingress shapers do not provide sufficient information to bound 
the worst case delay, since it is the burst tolerance of the substreams inside the 
shaped aggregates that is needed, but is unknown. 

In contrast, if the ’’worst case” provisioning is assumed, the network is pro- 
visioned in such a way that each link can accommodate all the traffic even if all 
edge-to-everywhere aggregates end up sharing this link. In this case the shaping 
parameters of the edge-to-everywhere aggregate are appropriate for the delay 
bound. Intuitively, in this case the aggregate burst tolerance on any link can 
only be smaller than in the worst case 

6 Summary 

We have discussed the delay bounds which can be obtained in a network with 
aggregate priority FIFO scheduling for priority traffic. Our main result is the 
delay bound for an arbitrary topology which is obtained as a function of priority 
traffic utilization and the maximum hop count. Unfortunately, this bound is valid 
only for reasonably small utilization. We have argued that for larger utilization 
the bound either does not exist, or must depend on some other constraints 
on the topology or routes. Our results imply that at least for small utilization 
and hop count, it is possible to provide strict delay guarantees even if only 
aggregate scheduling is employed in the network. We have argued that the delay 
bound can be improved by using appropriate aggregate shaping policies at the 
network ingress. It remains unclear whether the use of some additional network 
parameters characterizing the size of the network might yield reasonable delay 
bounds at larger utilization for a general topology. Understanding this issue 
appears to be an important area of future work both from the analytical and 
practical standpoints. 

^ This is because a low-rate substream of a high-rate shaped aggregate may be shaped 
to a much larger rate than its own. 

® Note that the ’’worst case” provisioning model targeting a particular utilization 
bound results in substantially more overprovisioning than the ’’point-to-point” pro- 
visioning using an estimated traffic matrix, or explicit point-to point bandwidth al- 
location using signaled admission control. 
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Abstract. Packet scheduling and queue management strategies are key 
issues of DiffServ per-hop behaviours. This paper proposes a queue 
management system that, in conjunction with scheduling mechanisms, 
is able to support class differentiation. The general principles and the 
architecture of the queue management system are presented. The 
proposal is supported by a prototype that was subject to several tests, in 
terms of packet drops and burst tolerance. The test results are presented 
and analysed, allowing an assessment of the usefulness and 
effectiveness of the underlying ideas. 



1 Introduction and Framework 

One of the most challenging demands for the new generation of network elements 
able to provide quality of service (QoS) is to provide better ways to manage packet 
queues lengths, as it is well known that some form of "active queue management" is 
needed to obtain better performance levels of the communication system - for 
instance, less transit delay, less packet loss level, better use of the available 
bandwidth, etc.. Important research is being conducted by many teams studying and 
discussing models and approaches for supporting such systems [2], [7], [8]. 

This paper contributes to that discussion by presenting a strategy for queue 
management (QM) specifically in QoS-capable IP networks following the 
differentiated services (DS) framework' [3]. The work presented here was conducted 
to fulfil the requirements of a broader project, whose main goal is to develop a new 
approach for the support of traffic classes in IP networks, while following the same 
framework. 

This broader, on-going project has three main goals: (1) to develop mechanisms to 
provide effective QoS capabilities in Network Elements (NE); (2) to conceive ways to 



' This framework is being promoted by the Differentiated Services Working Group of the 
IETF [5]. 

J. Crowcroft, J. Roberts, and M. Smirnov (Eds.): QofIS 2000, LNCS 1922, pp. 14-27, 2000. 
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select adequate, QoS-aware paths for packet forwarding along communication 
systems; (3) to implement effective ways for system management, including a 
strategy for traffic admission control. 

This paper results from the work done in relation to goal (1), which drove to the 
proposal of a new PHB (per-hop behaviour) and to the construction of a router 
prototype that implements it [13]. Two main contributions of this prototype are a new 
IP packet scheduling strategy, presented in [14], and a new IP queue management 
strategy, presented here. 

In Section 2 the general principles which rule the design of the proposed QM 
system are presented. Section 3 discusses its architecture, focusing on a prototype 
developed to test the underlying ideas. Section 4 presents the tests carried out on 
developed prototype and discusses the corresponding results. 



2 General Operational Principles of a DS Queue Management 
System 

Two important drawbacks characterise the traditional approach in use in the Internet 
for queue management, known as the tail drop (TD) approach [2]. The first drawback 
is called the lockout phenomenon, which happens when a single or few flows 
monopolise the queue space, preventing other flows from getting the space that 
normally they would use. The second drawback is known as the full queue problem 
and occurs because the TD approach allows full, or almost full, queues, during long 
periods of time. This increases the delay seen by packets and reduces the NE 
tolerance to packet bursts. In turn, lower tolerance to packet bursts results in a higher 
percentage of dropped packets and lower link utilisation levels, because packets are 
dropped in sequence and flows that are responsive to congestion (like, for instance, 
TCP flows) back-off synchronously. As a rule of thumb, a QM system should always 
provide room for a packet arriving at a network element. 

The tail drop approach is, therefore, inadequate for the modem Internet. In 
addition, and considering the DS framework, a network element should have an 
effective way to share its resources among different traffic classes, according to their 
QoS needs. As buffer space is an important resource, QM disciplines have an 
important responsibility in terms of resource sharing, side by side with packet 
scheduling disciplines. 

When executing drops, queue management systems should also consider the nature 
of the involved traffic. They should avoid dropping consecutive UDP packets 
belonging to the same flow, because the impact of loss will be dependent on the 
distance between dropped packets [9] for most applications that use this transport 
protocol. Consecutive dropping should also be avoided for TCP packets, in order to 
minimize the probability of eliminating packets belonging to the same TCP flow. 

Still related to the diverse nature of traffic, QM systems should deal with the 
growing volume of flows that are unresponsive to congestion (for the purpose of this 
discussion, UDP flows are considered to be unresponsive to congestion; nevertheless, 
UDP flows can also be responsive to congestion, depending on the upper layer 
protocols that are using this transport protocol). For this, they should use mechanisms 
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to protect responsive flows from unresponsive flows V currently, the resources freed 
by the former are immediately, and unfairly, used by the latter, when congestion 
happens. Additionally, QM systems should implement effective means to manage 
unresponsive flows, to avoid congestion. 

The queue management system developed at LCT-CISUC was motivated by the 
previous considerations. In broad terms, its design addressed two, closely related 
fields. In the first one, the idea was to design enqueuing and dropping processes in a 
way that avoids lockout, promotes burst tolerance and minimises the impact of packet 
loss on applications. In the second field, the idea was to conceive an effective way to 
manage queue lengths in order to control the drop level seen by the flows of each 
class^. Accordingly, the prototype constructed to test the proposed ideas was 
implemented in two phases, which resulted in two different modules: the packet drop 
management module and the queue length management module. These modules are 
presented in the next section. 



3 Architecture of the Queue Management System 

Figure 2 of [13], in the present QofIS 2000 Proceedings, depicts the QoS-capable 
router prototype implemented in LCT-CISUC, highlighting the central object of the 
queue management system operation V the IP output queues. 

At IP level, after the IP processing activity, packets are classified and stored 
accordingly in an output queue. It is there that the mechanisms which differentiate the 
QoS provided to classes act. The classifier/marker is responsible for determining each 
packet class and/or for setting the information related to it in the packet's header, 
following the strategy defined by lETF-DSWG^ [11]. 

The monitor continuously measures the QoS provided to classes, applying a QoS 
metric developed for this purpose. Its operation is discussed with more detail below. 

The scheduler is responsible for choosing the next packet to process. The dropper 
is responsible for choosing the packets to drop and also, in close relation to the 
monitor, for adjusting some operational parameters such as queue lengths. 

The queue management system, seen as whole, involves the process of storing 
packets in IP output queues and the action of the dropper. It is presented in the next 
two subsections, through the discussion of its two main modules. 



3.1 The Packet Drop Management Module 

There are two important parameters related to the dropper operation, that characterise 
each output queue: q_physical_limit, the maximum possible length for each 
queue (which is arbitrarily large); and q_virtual_limit, the desired maximum 
length for each queue. 



2 Each class has its own and exclusive queue in an NE. 

^ Internet Engineering Task Force - Differentiated Services Working Group 
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Each time a packet is to be enqueued and the physical limit is reached it will be 
immediately dropped. As this limit is made extremely large, this will be uncommon. 
Thus, every packet or burst of packets will normally be enqueued. 

From time to time the dropper is activated. This time is adjustable and should 
reflect the length of packet bursts to be accommodated (the default value is 15 
packets). When the dropper is activated, it verifies whether or not there are more 
packets than the amount imposed by the virtual limit for each queue. If this is the 
case, the dropper discards the excess packets in order for that limit to be respected. 
TCP packets will only be dropped if they exceed a given percentage of the total 
number of packets; moreover, TCP and UDP packets are randomly dropped. 

The logical operation of the Packet Drop Management Module is presented in 
figure B. 1 of [13]. In short, the module always provides room for an incoming packet, 
packet bursts are naturally tolerated, the drop of UDP packets is scattered and it is 
possible to protect TCP traffic from UDP traffic. 



3.2 Queue Length Management Module 

The Queue Length Management Module is responsible for the dynamic allocation 
of buffers to classes. The allocation of buffer space to queues is performed in 
conjunction with the scheduler action (which distributes the transmitter capacity 
among classes). The former controls the level of packet drops, and the latter controls 
the delay seen by packets [13]. 

The strategy used to share buffer resources follows the one used by the scheduler 
to share transmitter capacity [14]. The LCT-CISUC QoS metric [13][15] is nuclear to 
that strategy. According to this QoS metric, the quality of service is quantified 
through a variable named congestion index (Cl). There is a Cl related to delay and a 
Cl related to loss. The queue length management module uses the latter. 

Each class is characterised by a degradation slope (DSLOPE), which determines 
the classes' sensitivity to the degradation of the loss level. As can be seen in figure 1 , 
a traffic class highly sensitive to loss degradation will have a high degradation 
DSLOPE. 




Fig. 1. CIs for three traffic classes w/ different sensitivities to loss degradation (dl, d2 and d3 
represent the loss level experienced by each class when the congestion index has the value CIt) 
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It is easy to understand that, with growing loads that is, as the loss Cl grows the 
drop level seen by most sensitive classes is lower than the drop level seen by less 
sensitive classes. Using other words, more sensitive classes are protected by less 
sensitive classes, which absorb the major part of the degradation. 

Figure 2 presents the logical operation of the Queue Length Management Module. 



BEGIN 




CALL 

IFSTART 



Fig. 2. Queue length management module logical operation 



4 Tests 

The initial testbed used to carry out the tests consisted of a small isolated network 
with 4 Intel Pentium PC machines configured with a Celeron 333Mhz CPU, 32 MB 
RAM and Intel EtherExpress ProlOOB network cards. The prototype router {Router) 
ran FreeBSD 2.2.6 [6], patched with ALTQ version 1.0.1 [4], with 64MB RAM. Two 
hosts, Sourcel and Source!, generated traffic directed towards a destination host 
(Sink) through Router. Each host only generated traffic of a given class, in order to 
guarantee the independence of the generated packet streams. To perform additional 
tests involving one more class, another host Source! was installed in a subsequent 
phase. 

The tests were performed with the aid of three basic tools: Nttcp [12], QoS tat [1] 
and a modified version of Mgen [10] - Mgen m. 

Nttcp and QoStat were used to test the Queue Length Management Module. The 
former was used to generate data flows at the maximum possible rate according to a 
uniform load distribution, which competed for all the resources available in the 
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communication system. The latter was used to monitor the kernel namely the 
operation of the QM system and the QoS it provided to classes. QoStat was also used 
to dynamically change the operational parameters related to queue management. 

Mgen_m was used to test the Packet Drop Management Module. It was 
constructed using some parts of the MGEN code, with performance and reliability (in 
what respects to packet logging) in mind. The MGEN tool was also deeply modified 
in what respects packet generation, in order to allow the generation of IPRAW packet 
streams and bursty paeket streams according to a strategy followed in [8], as 
described in 4.2 under General Test Conditions . 



4.1 Test of the Queue Length Managemeut Module 

The first set of tests involved only two classes and high loads, with packets of 1400- 
byte length. The maximum number of packets in the queuing system 
(MAX_PCKS_ONQ) was configured to 60. 

Moreover, the two classes were configured with the same sensitivity to delay 
degradation: sensitivity slope equal to 45 degrees. On the other hand, the loss 
degradation sensitivity of the traffic classes changed with time, starting with a slope 
degradation of 45 degrees for both classes and, after each 25s time interval, taking up 
the following values (class l-class2): 40-50; 30-60; 25-65; and finally 20-70. 

The results of the tests are presented in figure 3. The capacity of the prototype to 
differentiate classes is obvious, namely in terms of the ability to provide different loss 
levels to classes. When they have the same sensitivity to loss degradation, both 
classes suffer nearly the same loss level. Changing their sensitivity to degradation 
results in a coherent change of the rate of dropped packets. When a class becomes 
more sensitive to loss degradation (see class 2 in figure 3) the rate at which its packets 
are dropped decreases. 

Moreover, the sharp transition shown in the graphs reveals that the system has a 
good capacity to react to changes. Thus, in this test, the prototype reveals a good 
capacity to effectively control the classes' loss level. 

Notice that, as the classes are configured to have the same sensitivity to delay 
degradation, the average packet transit delay experieneed by packets of both queues is 
grossly the same in all the tests - as can be verified in figure 3. It is possible to see that 
the delay increases as the elasses sensitivities diverge. This is natural given that the 
asymmetry of classes sensitivity induces asymmetry of queue lengths. This 
asymmetry changes the delay reference, which is obviously determined by the bigger 
queue. 

Table I shows some average values got by the QoStat tool during our experiments. 

The transfer of buffer space from the class with less sensitivity to the other class is 
quantified. It is possible to understand that the prototype reacts to changes in the 
classes' degradation slope (which induces asymmetry on the correspondent loss 
congestion indexes), transferring buffer space from one class to the other. When both 
classes have the same sensitivity, the prototype gives nearly the same number of 
buffers (29 and 31) to each class. 
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It is also possible to see that, in fact, the prototype reacts to changes in the classes' 
DSLOPE giving the same loss congestion index to both classes. As expected, the 
congestion index is lower when the difference between degradation slopes is bigger. 



Table 1. Prototype operational values for the first set of tests 



□Slope 

Class!/ 

Class2 


Class 1 


Class2 


Class 1 


Class2 


Q_virtual_limit 


Q_virtual_limit 


Cl 


Cl 


45/45 


29 


31 


50 


49 


40/50 


22 


38 


49 


48 


30/60 


11 


49 


43 


43 


25/65 


6 


54 


38 


38 


20/70 


2 


58 


32 


32 



Figure 4 shows the results of one additional test, carried out using the same 
methodology as in the test just presented, but now having configured different 
sensitivities to delay degradation of classes, instead the same delay sensitivity. In the 
test presented in Figure 4 the delay DSLOPE of class 1 was fixed at 40 degrees and 
the one of class 2 fixed at 50 degrees. As expected, the prototype still reveals the 
capacity to differentiate traffic classes, but now, coherently, providing different 
average transit delay to packets belonging to different classes. 

In order to evaluate the real influence of the queue management system on the 
router behaviour, some tests were carried out activating and deactivating it. To 
deactivate the QM system means to use the traditional tail-drop scheme, with 
maximum queue lengths of 50 packets for the queues of both classes. 

The tests follow the approach mentioned before. The sensitivity to delay 
degradation was fixed - DSLOPE was made equal to 45 degrees during all the test. 

In the first 25-second time interval, a loss degradation slope equal to 45 was used 
in both classes, and the QM system was activated. In the second 25s interval the QM 
system was deactivated. In the third interval, loss degradation slopes equal to 30 and 
60 degrees were used respectively in class 1 and class2, and the QM system was again 
activated. In the fourth time interval it was, once more, deactivated. 

Figure 5 presents the results of this test. During the first time interval, as the 
DSLOPEs related to delay and loss were equal to 45 degrees for both classes, the 
classes received the same treatment in terms of average transit delay seen by packets, 
as well as in terms of number of packets forwarded by second. When the QM system 
was deactivated the average transit delay raised substantially. This corresponds to the 
growth of the maximum output queues length, and reveals an important fact about the 
efficiency of the QM system under test. In fact, with the QM system running, the 
same loss level can be achieved with a much lower average transit delay. That is, for a 
given loss level, the QM system leads to much shorter average queue lengths. 

The third and forth time intervals clearly show the importance of the QM system. 
When the classes' sensitivities to loss degradation are different, the prototype is only 
able to differentiate them accordingly when the QM system is activated. 
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Fig. 3. Variation of average transit delay, 
number of packets send, and number of 
dropped packets with classes' 
sensitiveness to loss degradation 



Fig. 4. Variation of average packet delay and 
number of packets sent with classes' 
sensitivity to loss degradation - delay dslope 
fixed to 40/50 (classl/class2) 
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4.2 Test of the Packet Drop Management Module 



The main goals of the tests presented in this section are to evaluate the capacity of the 
prototype to tolerate bursts and to evaluate the effectiveness of the drop strategy. The 
idea was to evaluate the improvements that can be expected from the prototype 
behaviour, when compared to a normal router behaviour. 

The general test strategy consisted of generating two packet streams (PS) one 
probe PS and one load PS and to measure the impact of the queuing and dropping 
process on the probe stream. 
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Fig. 5. Variation of average packet delay and number of packets sent with classes' sensitiveness 
to loss degradation with and without QM 



The first set of tests were executed using a normal FreeBSD computer as Router, 
and generating packets streams in Source! and Source! hosts. As normal router, a 
FreeBSD v2.2.6 system running on a PC was used. The Mgen m tool was used for 
packet generation purposes, which was motivated by the reasons referred above. 

The second set of tests was executed using the router prototype, and generating the 
two packet streams on the same class. This was done in order for the tests to be 
comparable. As there is only one IP output queue when using a normal router, packet 
streams of the same class were generated to guarantee also that only one queue was 
used in the tests with the router prototype. 

The tolerance to packet bursts was aluated measuring the number of packets 
dropped in sequence a higher tolerance to bursts means less packets dropped in 
sequence. 

The effectiveness of the dropper strategy was evaluated using a metric developed 
by Koodli and Ravikanth [9]. In general terms, the intention was to measure whether 
or not the prototype was able to spread out the drop of UDP packets using the referred 
loss-distance-stream metric. This metric determines a spread factor associated with 
the packet loss rate. 

Consider a stream of packets characterised by a sequence number that starts at zero 
and increases monotonically by one. The difference between the sequence number of 
a lost packet and the one of the previously lost packet is named the loss distance. 
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According to the used metric, a lost packet is noticeable if its loss distance is no 
greater than delta - where delta is a positive integer named loss constraint . 

The following subsections present (1) the general test conditions, (2) the tests 
carried out for evaluating the burst tolerance and (3) the test performed to evaluate the 
effectiveness of the drop strategy. 



General Test Conditions 

The tests were carried out using two different load settings LDl and LD2 and two 
settings for traffic pattern: smooth traffic, SMT, and bursty traffic, BST. In short, the 
tests were performed in the following 4 different scenarios: SMT -LDl, SMT-LD2, 
BST-LDl,andBST-LD2. 

The size of generated packets was fixed to 1000 bytes. The LDl scenario was 
constructed generating each of the two packet streams at a rate of 52 Mbps. In the 
LD2 scenario, each packet stream was generated at 60 Mbps. SMT corresponds to 
traffic that follows a uniform distribution whereas BST corresponds to traffic that 
follows a Poisson distribution four back-to-back packets generated at exponential 
time intervals. The tests result was the log file of the probe PS on Sink. Through it, the 
drop distribution was analysed. 

For each of the referred scenarios ten tests were executed using the normal router, 
and another ten tests using the prototype router. Each test involved the generation of 
approximately 200.000 packets. The values presented below correspond to the 
average of the values obtained in each test. 

Additionally, one of the scenarios was chosen (the LD2-BST scenario) and again 
two sets of ten tests were carried out, now using as probe flow a stream of IP Raw 
packets. The idea was to emulate a TCP packet stream without using any congestion 
control mechanism (and thus, to evaluate the differences of the prototype drop 
behaviour when dealing with TCP traffic instead of UDP traffic). The results of the 
tests are presented in the following sub-sections. 



Evaluation of the Tolerance to Packet Bursts 

Figures 6 through 9 show the percentage of total dropped packets in bursts of 1 , 2, 
3 n packets. 

It is possible to see that, when the router prototype is used, the histogram shifts to 
left. Using other words, packets are dropped in sequences of only a few packets in 
most cases only one packet. This is evident in all the scenarios used in tests. 

Thus, the prototype reveals good characteristics in what respects its capacity to 
accommodate bursts of packets, leading to much better behaviour than in the case of a 
normal router, using the traditional tail-drop approach for queue management. This is 
also apparent even in the BST scenarios, where there are bursts of packets, and where 
the use of the prototype does not result in long sequences of dropped packets. 
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Fig. 11. Noticeable Loss (scenario LDI-BST) 




Fig. 12. Noticeable Loss (scenario LD2-SMT) 
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Fig. 14. Noticeable Loss (scenario LD2-BST) 



Evaluation of the Drop Effectiveness 

Figures 10 through 13 show the evolution of the noticeable loss with loss constraint, 
for the different load scenarios. The figures show that the prototype tends to spread 
out the drop of packets. In fact, it can be seen that, when using a normal router, almost 
all the packet drops are "noticeable" when the loss constraint is 1 (using other words, 
noticeable losses immediately reach a value near the maximum when the loss 
constraint is only 1). When using the router prototype, the percentage of noticeable 
losses when the loss constraint is 1 is much lower, and it grows gradually until its 
maximum value. 

This happens in all the scenarios. It is, nevertheless, more evident with lower loads 
(LDl) than with higher loads (LD2); in turn, it is more evident with smooth traffic 
than with bursty traffic. In fact, in such conditions, the noticeable loss for low loss 
constraint falls more deeply, taking as reference the maximum noticeable loss. This 
was to be expected: higher loads and bursty traffic will tend to increase the drop level 
and, thus, drops cannot be as "spread out" as in the others scenarios. 

Figure 14 presents the results of the same type of tests as the ones presented before 
(scenario LD2-BST), but now using a TCP packet stream as probe traffic. The IP raw 
capability of FreeBSD was used for this purpose, generating datagrams in such a way 
that they are processed by router as if they correspond to TCP traffic. 

One of the most evident conclusions (see figure 14) is that the prototype effectively 
protects TCP traffic from UDP traffic. Noticeable losses are much lower when the 
prototype is in use. 

In short, despite the low level of drops, the prototype shows the expected 
behaviour. 
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5 Conclusions and Future Work 

Network elements are essential for the provision of predictable and controlled quality 
of service. Inside network elements, scheduling and queue management are important 
building blocks of per-hop behaviours that can lead to the desired quality of service. 

This paper presented a queue management system developed for the support of 
differentiated services. After the presentation of the queue management system 
principles and general architecture, tests to its main blocks were presented. The tests 
revealed a good capacity for class differentiation, with good performance both in 
terms of packet drops and burst tolerance, showing that it is possible to overcome the 
drawbacks that characterise the traditional approach in use in the Internet for queue 
management. 

The presented queue management system is part of QoS-capable router prototype 
being developed by the authors. Further work will be carried in the context of this 
prototype, namely the execution of further tests (scalability tests and tests with real 
load patterns) and the refining of dropping and queue-length management algorithms. 
Additionally, work addressing issues such as scheduling, QoS-aware routing and QoS 
management in under way, in more complex scenarios. 
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Abstract. The RED algorithm is one of the active queue manage- 
ment methods promoted by the IETF for congestion control in backbone 
routers. We analyze the dynamic behavior of a single RED controlled 
queue receiving a Poisson stream with time varying arrival rate, describ- 
ing the aggregate traffic from a population of TCP sources. The queue is 
described in terms of the time dependent expected values of the instan- 
taneous queue length and of the exponentially averaged queue length, for 
which we derive a pair of ordinary differential equations (ODEs). The 
accuracy of the model is verified for different arrival rate functions by 
comparing simulated results against the numerical solutions of the ODEs. 
For instance, the model captures well the oscillatory behavior observed in 
simulations. Also, it is possible to use the linearized version of the ODEs 
to explore bounds for the parameters of the system such that e.g. upon 
a load change the system reaches the new equilibrium gracefully with- 
out oscillations. These results, in turn, can be used as basis for some 
guidelines for choosing the parameters of the RED algorithm. 



1 Introduction 

Random Early Detection (RED) was proposed by Floyd and Jacobson [5] as an 
effective mechanism to control the congestion in the network routers/gateways. It 
also helps to prevent the global synchronization of the TCP connections sharing 
a congested buffer and to reduce the bias against bursty connections. Currently it 
is recommended as one of the mechanisms for so called active queue management 
by IETF (see Braden et al. [II]) and it has been implemented in vendor products. 

A considerable amount of research has been devoted to the performance 
analysis of the RED algorithm. Primarily the methods used have been simulation 
based and have been aimed at solving some of the deficiencies of the original RED 
algorithm. Lin and Morris [6] showed that RED is not efficient in the presence 
of non-adaptive (or “non-TCP-friendly”) flows and for that they propose a per 
flow version of RED. Similar approach has been proposed by Ziegler et al. in [14]. 
Feng et al. [3] propose an adaptive discarding mechanism where the maximum 
discarding probability parameter of RED is varied according to the number of 
flows present in the system. Ott et al. [9] also utilize information about the 
number of flows sharing the buffer to determine whether a packet should be 
dropped or not. Clark and Fang [2] have suggested the use of RED to provide 
different classes of services in the Differentiated Services framework. 
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Recently, some studies have been published where an analytical approach has 
been used. May et al. [7] [8] and Bonald and May [I] have developed a simplified 
model of the original RED algorithm. However, RED is more complicated and 
several of its important features (e.g. the average queue length process) have 
not been captured in this model. A more detailed analysis has been presented 
in Peeters and Blondia [10]. Sharma et al. [12] analyzed the RED algorithm in 
complete detail and also developed asymptotic approximations for easing the 
computational burden of computing the performance indices of interest. 

A full description of the problem should consider a closed system consisting 
of two parts, a) a queue receiving an aggregate flow of packets and reacting 
dynamically to the changes in the total flow rate under the control of the RED 
algorithm, and b) the population of TCP sources, each of which reacts to the 
packet losses at the queue as determined by the TCP flow control mechanism 
(see Figure 1). In this paper we focus on part a) of the problem. It is reasonable 
to assume that with a large number of sources the aggregate traffic stream can 
locally be described by a Poisson process, since the phases of different sources are 
basically independent. So our task is to analyze the dynamic behavior of a RED 
controlled queue receiving a Poisson arrival stream with a time varying arrival 
rate X(t). Such a model can then be incorporated in a larger model describing 
the interaction of parts a) and b) . 




Fig. 1. Interaction between the TCP population and the RED controlled queue 



Initial steps in the above described direction were already taken in [12], where 
the transient behavior of the system was studied in an asymptotic regime, i.e. for 
very small values of the updating parameter of the RED algorithm. Here we de- 
velop a more general model where the arrival rate A(t) can be any function of 
time and no asymptotics are needed. Our objective is to describe the time de- 
pendent expectations of the instantaneous queue length and of the exponentially 
weighted average (EWA) queue length. We derive a pair of ordinary differential 
equations (ODEs) for these two variables. This gives an analytic model which 
enables us to study the dynamics of the system. In our numerical examples we 
explore the behavior of the system with various arrival functions, e.g. one which 
exhibits sinusoidal variations and another one which is a step function. We verify 
by simulations that the model indeed gives accurate results. 

The response of the system to a step function is of a particular interest, as 
the characteristics of this response are often used to tune the parameters of a 
system in traditional control theory. Typically, it is required that the system 
should approach equilibrium nicely, i.e. without oscillations or too much over- 
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shooting. We derive conditions in terms of the parameters of the system such 
that oscillations do not occur. This can be used as basis for heuristics in de- 
termining proper value ranges for the some of the RED algorithm parameters. 
Floyd and Jacobson [5] have also alluded to the problem of oscillations when 
discussing the choice of the parameters of the algorithm. Our results here com- 
plement the results given in [5] and [4] , where the guidelines have been based on 
results obtained via simulations. 

The paper is organized as follows. Section 2 gives a brief description of the 
RED algorithm itself. Section 3 contains the derivation of the ODEs that govern 
the behavior of our system. In Section 4 we analyze the behavior of the system 
under a step arrival function around the equilibrium point. In Section 5 we 
present our numerical results and Section 6 contains our conclusions. 



2 The RED Algorithm 

First we briefly review the RED algorithm. Consider a queue with a finite buffer 
which can hold up to K packets at a time. Let be the queue length and s„ 
be the EWA queue length (to be defined below) at the arrival. Also, c„ is 
the counter representing the time (in packet arrivals) since the previous packet 
discard. Then the RED algorithm probabilistically drops packets with probabil- 
ity pn in the following manner. 

For each arriving packet we compute the EWA queue length s„ with 

/ — (1 /5)^n— 1 4“ PQn^ If Qn ^0 . 

\s„= (l-/3)™s„_i, ifg„ = 0, 

where m is the ratio of the idle time during the interarrival time of the 
packet and a typical transmission time for a small packet and 0 < /3 < 1 is 
an appropriate constant. If the buffer is full, the packet is lost (i.e. = 1) 

and Cn = 0. If s„ < Tmin, the arriving packet is accepted (i.e = 0) and c„ = 
— I. If Sn > Tmax the packet is discarded (i.e. Pn = I) and c„ = I. However, 
if ^ ^ Tnax we Set 

Cn — Cn — 1 ~h 1, 

Cn — (Tjnax JAiin)/[Pmax('^n Tnin)]5 
Pn = min[I, l/(Cn - Cn)], 

Then, with probability pn, the packet is discarded, otherwise it is accepted into 
the queue. 

Typically, j3 is chosen to be relatively small, e.g. Floyd and Jacobson [5] 
propose using f3 > 0.001 and use f3 = 0.002 in their simulations. The latest 
recommendations for choosing the other parameters can be found in [4], where 
it is proposed that Pmax ^ 0.1,Tmin ^ b and Tmax ^ 3 X Lmln- 

Also, observe that the role of the counter Cn in the RED algorithm is to 
distribute the packet drops of RED more evenly. In fact, if we assume that under 
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steady state the incoming packets would see the same value for , the inclusion 
of the counter c„ has the effect that the distance between two consecutive packet 
drops has a uniform distribution in the range [1, . . . , l/C^] rather than being 
geometrically distributed, which would be the case if = \jCn (see [5]). The 
primary reason for having the counter is then to make the flow of congestion 
indications back to the TCP sources in the steady state as even as possible 
to reduce the likelihood of having a burst of dropped packets resulting in an 
increased possibility for synchronization among the TCP sources. However, from 
the point of view of the queue behavior, the counter has little effect. Hence, we 
will here consider the RED algorithm with the counter c„ removed. Then the 
discarding probability when Tmin < Sn < Tmax, depends only on the value 
of the EWA queue length Sn, i.e. 

1 Pmax(Sn Tmin) 

^ rp rp 

max min 

Later in the numerical examples section we will discuss the impact of this and 
also give a simple correspondence to approximate the behavior of the actual 
system with the counter by a counterless system. 

3 The ODE Approximation 

In the following, in accordance with the discussion in the introduction, we assume 
that the packet arrival stream constitutes a Poisson process with a time varying 
arrival rate \{t) and that the packet lengths are exponentially distributed with 
mean l//r. Also, for the sake of simplicity, we will not take into account the 
special case of computing s„ differently when the arrival occurs into an empty 
queue. The effect of this special handling is less important for heavily loaded 
queues which are our primary interest here. 

Our aim now is to develop a model for the transient behavior of the system in 
continuous time and for that we first look at the development of . The discrete 
time equation driving this is obtained from (1), 

— /^(Qn+l ■^n)- (2) 

Eq. (2) relates the values of Sn+i and to each other at the arrival instants 
of the packets. However, we can also study the EWA queue length process in 
continuous time, i.e. the process s(t). Consider a short interval of length At. 
Then by taking expectations of (2) and conditioning on the number of arrival 
events A(At) occurring in the interval (t,t + At) we obtain 

E[Z\s(t)] = E [As{t) I A{At) = 1] P{A{At) = 1} + 

E I A{At) = 0] P{A{At) = 0} + 0{At) 

from which 

A [E[s(t)]] = f3E[q{t) — s{t)]X{t)At + 0{At) 

= \{t)P (E[g(t)] - E[s(t)]) At + 0{At) 
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Then, by denoting s{t) = E[s(t)] and q{t) = E[( 7 (t)] and taking the limit At ^ 0 
we obtain the ODE 

= (3) 

A similar equation was also derived in [12] as an asymptotic result pertaining to 
the process s{t) itself (as opposed to the expectation s{t) considered here). No 
approximations are involved in eq. (3). 

The exact dynamics of the term q(t) in (3) are unknown. To overcome this, 
in [12] a quasi-stationarity approximation, which is exact in the limit (3—^0, 
was made based on the fact that in practice we are interested in evaluating (3) 
for systems where /3 is very small, say 10“^. Then s{t) changes very slowly com- 
pared to q{t) and, hence, it can be assumed that the term q{t) is sufficiently 
well approximated by the mean stationary queue length of a hypothetical queue 
length process where current value of the EWA queue length s{t) controlling 
the access to the queue is fixed. To be precise, consider the stationary queue 
length as a function of the access control parameter s which is constant. 
Our approximation amounts to q{t) = E[g*^®^]|s=s(i), i.e. the time dependence 
is introduced parametrically to the stationary distribution. However, the simu- 
lation results in [12] showed that even when f3 = 10“^ the finite time it takes 
for the queue length process to reach stationarity has a noticeable effect on the 
queue dynamics. 

Thus, we will here analyze in more detail the expectation of the instantaneous 
queue length q{t), and we introduce another ODE for q{t). Therefore, we again 
look at a short interval of time At, condition on having 0, 1 or more arrivals and 
take the limit — > 0. With some additional assumptions we obtain the result 

j^q{t) = \{t) (l - TTif (t I g(t))) (l - p(s(t))) - Mt I Q{t))) , (4) 

where the first term is the expected rate at which packets are admitted to the 
queue and the second term is the expected rate at which packets leave the system 
and p{s) is the rejection probability 

/ _\ ^ ifmin 

P(S) = PraaxTj; _ rp ■ 
max min 

In order to define the quantities TTo{t \ q{t)) and TTK{t \ q{t)) in (4) and to see the 
nature of the approximations involved, let us examine more closely e.g. the first 
term in (4). Aside for the factor A(t), the other factors represent the following 
expected acceptance probability Pacc(t) of a packet at time t: 

^’acc(t) = E [lq(t)<K ■ R{t)\s{t), q{t)] , 

where R{t) denotes the Bernoulli variable describing whether RED will accept 
the packet or not. The conditioning comes from the fact that s{t) and q{t) are 
known by the ODEs (though with some approximations). 

The expectation can be evaluated as follows by further conditioning on s(t) 

-Pacc(t) = E [E [lq(t)<K ■ R{t) I S{t)] I S(t), q{t)] . 
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Now, given s(t), the random variable R{t) is independent of q{t) and its expec- 
tation is just (1 — p{s{t))). Thus 



It is reasonable to assume that, given q(t), q(t) does not depend appreciably 
either on s{t) or s{t); similarly, given s{t), s{t) depends almost solely on s{t) and 
not on q{t). Thus, noting also that p(-) is a linear function of its argument we 
arrive at 



where TTK{t \ q) = P{q(t) = K \ q}. In a similar way we obtain the second term 
in (4) with 7ro(t | q) = P{q{t) = 0 | q}. 

Now, the exact dynamics of 7To(t | q(t)) and 'Kxii \ Q{t)) are unknown to us, 
but we can again make a quasi-stationarity assumption. This amounts to that 
e.g. 7To(t I q{t)) is equal to ttq = 7To(g)|g=q(t) obtained from the first pair of equa- 
tions 



by eliminating p (yielding TTo(q)) and evaluating the result at q = q{t). Similarly, 
TTK{t\ q{t)) is obtained from the second pair of equations. This approximation 
technique is similar to the one used in [13]. To summarize our system of ODEs, 
suppressing the explicit dependence on time, from (3) and (4) is 



The elimination of p in the above can be done numerically on the fly while 
integrating the ODE system. 

However, if we assume the buffer to be of infinite size, i.e. we let K —foo, 
there is no overflow in the queue and the ODE becomes 



To approximate ttq we use the same quasi-stationarity heuristics as before. In 
this case, 7ro(g) is obtained by eliminating p from 



Pacc(t) = E [(1 - p{s{t)))E [lq(t)<K I S{t)] \ s{t) , q{t)] . 





= Xf3{q - s), 

^q= A(1 - 7Tiy(9))(l -p(s)) - p{l - 7To(q)). 



( 5 ) 
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yielding the result TTo{q) = 1/(1 + <?)• Thus, we have an explicit form for the 
ODE system 



4 Equilibrium and Stability of the System 

Having now established ODE systems, which describe the mean behavior of the 
RED controlled queue under a time varying load, we can apply similar concepts 
as are used in the traditional control theory of deterministic systems. Therein 
one important consideration is the so called step response, i.e. the response of 
the system when the input changes as a step function. Then it is often required 
that the step response reaches equilibrium nicely, i.e. that the trajectories reach 
equilibrium without any oscillations or that the oscillations are dampened suffi- 
ciently quickly. 

Thus, we will here show, how such concepts could be applied in our case. We 
first study the equilibrium solutions, so called attractors, of (5) and (6) when 
the \{t) is step function 



Also it is assumed that the system state is g = s = 0 at time 0. This enables 
us to identify what we call an effective control region for the algorithm and to 
derive bounds for the parameters of the system such that the system stays in 
this region. Having established the attractor of the ODE in the effective control 
region, we can linearize the ODE around that point to get further insight into 
how the system approaches the equilibrium, i.e. whether the trajectories of the 
ODE system oscillate before reaching the equilibrium and, if they do, how the 
oscillations are damped. This provides us a means to establish criteria on the 
properties of the linearized system e.g. by requiring that the solutions of the 
ODE are not oscillatory or that the oscillations are damped rapidly enough. 

In general, the attractor is obtained by equating the right hand side of the 
ODE system, either (5) or (6), to 0. Now, let the general form of our ODE be 
written as 




( 6 ) 





where always /i(s, 9 ) = A/3(s — q) and the form of f 2 {s,q) depends on whether 
the buffer length is assumed to be finite or infinite. The attractor (s*,q*) is 
obtained as the solution to the system of equations 
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From the first equation it immediately follows that s* = q*. Finally, q* is deter- 
mined by the equation 

f 2 {q*.q*) = 0. ( 8 ) 

It is also useful to know for which values of the load p the attractor lies in the 
RED controlled range Tmin < < 7 * < Tinax- The values of the load corresponding 
to the boundaries of this range can be solved from ( 8 ) by setting q* = Tjnin 
and q* = Tmax, respectively. Therefore, we say here that the algorithm is in the 
effective control region when Tmin < 9 * < Tmax- 

Now, we can study how the equilibrium is reached by linearizing the ODE 
system near the attractor {q*, q*). To this end, let As = s — q* and Aq = q — q* 
be the deviations of s and q from the attractor. Consider the state variable (in 
vector notation) x = {As, Aq)"’" . Close to the attractor the time derivative of x 
is obtained from the linearized version of (7), 

^x = J(q*)x, (9) 

where J(q*) is the Jacobian evaluated at the attractor. 



J(9*) 



/ dfi{s,q) dfi{s,q) \ 
ds dq 

df2{s,q) df2{s,q) 

\ ds dq / 



s^q* 



,q^q* 



The general solution of (9) is 



( 10 ) 



x(t) = -k 02 ^ 26 "'''*, 



where Zi and i = 1 , 2 , are the eigenvalues and eigenvectors of J( 9 *), 

i.e. J( 9 *)^i = and Qi,i = 1 , 2 , are constants. 

Note that the eigenvalues are in this case obtained as a solution to a quadratic 
equation and, hence, both eigenvalues are at the same time either real or com- 
plex. Therefore, once we know the eigenvalues of J (<?*), we can determine whether 
the linearized system approaches the attractor q* with decaying oscillations 
around the attractor (complex eigenvalues) or smoothly without any oscillations 
(real eigenvalues). Then we can also define requirements that the eigenvalues 
should meet. 

A nicely behaving system would be one where oscillations do not appear, in 
which case we require that Q(zi) = 0 , where 3(-) denotes the imaginary part. 
This translates to a requirement for the discriminant of the eigenvalue equation. 
By using the following short hand notation for the elements of J( 9 *), 






/ Oi 02 
\a3 04 



the eigenvalue equation reads 

— (oi -I- 04)z -I- 0i04 — 0203 = 0, 
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and the requirement = 0 means that the discriminant of this quadratic 

equation must be greater than or equal to zero, 

(oi — 04)^ + 40203 > 0. (11) 

Thus we have obtained a requirement in terms of the parameters of the system 
such that the system meets our specified criteria (no oscillations) . An alternative 
way to define the criteria would be to allow for some oscillations in the system 
but require that the amplitude of the oscillations decays by some predetermined 
factor for each full oscillation cycle. 



4.1 The Case of Infinite Buffer Size 



In the case of a finite buffer, the equations of the previous section can only 
be solved numerically. However, if we assume the buffer size to be infinite the 
solution can be obtained in a closed form. The ODE system in this case is given 
by (6). The equation for the attractor (8) reads in this case 



( f Pma 



q* - 'Tmii 






1 + q* 



= 0 , 



( 12 ) 



yielding the result 



q = 



-d±^-AX^p 

max (rmax ^min Pmax^min) dP" 

2Xp 

max 



where d — Apniax ^max(A //) “h ^min(Apinax A). 

The bounds on the load p = X/fi such that the equilibrium lies in the stable 
region Tmin < < Tmax can also be computed. From (12) we obtain for q* = 

T • 

J- mm 



A — /i 






1 + Tfr 



= 0 



Trr, 



Pmin — 



and for q* = T„ 



A(1 Pmax) P 









— 0 ^ Pmax — 



1 



Trr 



1 Pmax ifmax 1 



(13) 



(14) 



Additionally, the requirement that the solution is not oscillatory can be for- 
mulated in closed form. To this end, we first compute the Jacobian J((?*) using 
eqs. (6) and (10), 



-A/3 

3{q*) = I -Xpma: 



A/3 

-P 



Tmax - Tlnin (1 + q*Y 

Inserting the elements of the above in (11) we obtain the condition for the 
discriminant, D, 



D = -A/3 



P 



^ 4A^/3pn 



(l+g*)V 



> 0 . 



(15) 
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5 Numerical Studies 

Here we examine via numerical examples the results obtained in Sections 3 and 
4. We will compare the numerical results given by our models against simulated 
results. Throughout the examples, unless stated otherwise, we will use f3 = 0.002, 
as has been suggested in the literature. Also, in all the examples here /r = 1, 
i.e. the unit of time is taken to be 1 / ^ and the initial condition for transient 
cases is such that the system is empty at time t = 0. The simulation results have 
been taken as an average over 5000 realizations. 



5.1 Validation of the ODEs 

First we examine the accuracy of the step response of our model for a system 
controlled by RED without the counter. We start by considering ODE system 
(5) for the finite buffer case and look at an example with K = 100,Tmin = 
20,Tmax = 40 and Pmax = 0.4. The input load p{t) of the system rises from 0 
to p = 1.2 at time 0, i.e. a heavy load. Figure 2 shows the solution of the ODE 
(solid curves) and the simulation results (dashed curves) for both s{t) and q{t). 
As we can see the oscillations are clearly visible in the real simulated system and 
they are fairly well captured by the solution of the ODE. 





500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 

Fig. 2. The solution of ODE (5) (solid curves) and simulation results (dashed 
curves). Figure on the left depicts s{t) and the figure on the right shows q{t) 



Then we experiment with a time varying load such that p(t) = 1.2 + 0.2 
sin(27rt/T), where T = 1000. Other parameters were as previously. The results 
are shown in Figures 3. The figure on the left depicts p{t), the middle figure 
shows s and q is shown in the right figure. Again, simulated results are shown 
with dashed curves and the solutions to the ODE correspond to the solid curves. 
The results show that the ODEs are able to capture the behavior of the system 
even under a time varying load. A recurring observation one can make from all 
the presented figures is that the numerical solutions from our model appear to be 
slightly “ahead” of the corresponding simulated results. This is due to the fact 
that a quasi-stationarity assumption was used to obtain the empty queue and 
full queue probabilities in the ODE for q{t). The approximation causes the ODE 
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system to react to the dynamic changes more “promptly” than what happens in 
the actual system. 















/ \ 




/ \ 




/ \ 








\ / 























































Fig. 3. The solution of ODE (5) (solid curves) and simulation results (dashed 
curves). Figure on the left depicts p{t) = \{t), the figure in the middle is s{t) 
and the figure on the right shows q{t) 



5.2 Approximating the Effect of the Counter 



Here we show that the effect of the counter on the behavior of the queue can be 
approximately taken into account in a counterless system by modifying the p{s) 
function. To this end, consider the number of packet arrivals, X , between two 
consecutive packet drops. As was shown in [5], assuming s to be constant or very 
slowly varying, X is uniformly distributed in the range [1, . . . , 1/pd], where pd = 
Pmax(s-Tmin)/(rmax-Tmin). The expectation of A is then E[A] = l/2pd+l/2. 
If X is geometrically distributed with parameter pd, as we have assumed thus 
far, this expectation is simply E[A] = 1/pd- Hence, we can obtain a counterless 
system approximating the real system including the counter by choosing pd to 
satisfy 



1 

2pd 



1 

Pd 



Pd = 



‘2pd 



‘^Pd- 



1 + Pd 

To check the accuracy of the approximation, we consider the same example 
as earlier, but now the simulated system is the RED algorithm with the counter 
included and in the ODE system (5) we use the above heuristics, in which case 
the discarding probability function is given by 



p{s) = 



2pmax(5 Ifmin) 



Tmax ^min T Pmax(.5 l^min) 



Figure 4 shows the solution of the ODE and the simulated results (dashed curves) 
for both s{t) and q{t). Again, we can observe how the approximation for the 
counter in the ODE well captures the behavior of the real simulated system. 



5.3 Tests on the Linearized System 

Here we experiment with the infinite buffer system obeying ODE (6) and the use 
of the linearized system and its Jacobian J((?*) to determine the parameters such 
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500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 

Fig. 4. The solution of the ODE (5) (solid curves) with the approximation for the 
counter and simulated results (dashed curves) for the complete RED algorithm. 
Figure on the left depicts s{t) and the figure on the right shows q{t) 



that the step response of the system does not have oscillations, given by (15). 
First we study the effect of p, Pmax and j3 on the requirement (15), when T^in 
and Tmax are fixed. As an example, we choose T^in = 5 and T^ax = 20 (as has 
been suggested in [4]) and solve (15) as a strict equality. 




1 

(l + <7*) 




4p^/3pmax 

15 



(16) 



for /3 as a function of Pmax for a given value of p. We vary Pmax in the range 
[0. in which case the attractor q* is in the effective control region if 
the load is in the range [0.833, . . . , 1.058]. Note that for a fixed value of load 
and Pmax {q* does not depend on /3) (16) reduces to a quadratic expression 
which has two solutions and, in this case, both real and positive. However, we 
are mainly interested in the smaller values (the greater values are impractically 
large). The results are shown in Figure 5. 




0.2 0.4 0.6 0.8 1 

Pmax 



Fig. 5. The allowable regions for (pmax,/3) for different values of p (the unit of 
the y-axis is 1000 • (3) 



Then we illustrate the importance of a particular combination of RED pa- 
rameters, namely Pmax/(Fniax — Fmin)) on the performance. Assume that we fix 
(3 = 0.002, Tmin = 5, Pmax = 0.2 (as is suggested in the literature). Then we 
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plot the value of the discriminant eq. (15) (denoted by D) as a function of T„iax 
for different load parameters p = [0.85,0.90,0.95,1.0]. This is shown in Fig- 
ure 6 (left figure), where the highest curve corresponds to load p = 0.85 and 
the lowest curve to load p = 1.0, respectively. As we can see the value of the 
discriminant is negative when T^ax is close to the value of Tmin, indicating that 
the system has oscillations. This also reflects the fact that the parameter combi- 
nation Pmax/ (Tniax ~ 2min) tries to coiitrol the queue too aggressively. However, 
as the value of Tmax is increased the value of the discriminant becomes posi- 
tive or becomes very close to 0 (although staying negative). We repeated the 
same computations for loads p = [0.95, 1.2, 1.4], but doubled the values of T„iin 
and Pmax, i-e. we set Tmin = 10 and Pmax = 0.4. The results are shown again 
in Figure 6 (right figure), where the highest curve corresponds to the lowest 
load (0.95) and the lowest curve to load (1.4). The results are similar to the 
earlier ones, but this time we can see that we are not able to make the system 
completely non-oscillatory. 





10 20 30 40 50 60 10 20 30 40 50 60 

Tmax T„ax 

Fig. 6. The values of the discriminant eq. (15) as a function of T^ax for different 
parameter values of /3,Pmax and Tmin (the unit of the y-axis is 1000 • D) 



6 Conclusions 

In this paper we have studied the performance of a single queue where the RED 
algorithm is used to control the access to the queue. The packet stream has 
been assumed to be Poisson with a time varying arrival rate, representing the 
aggregate traffic generated by a population of TCP sources. The aim has been to 
derive a model describing the dynamics of the RED controlled queue when the 
arrival rate is an arbitrary function. Hence, the model makes it also possible to 
later incorporate the influence of the packet losses due to the RED control, which 
causes the TCP source population to adjust its arrival rate. Of primary interest 
has been to model the behavior of the expectations of the instantaneous queue 
length, q{t), and the exponentially averaged queue length, s{t). This behavior 
can be expressed by a pair of ODEs, one for s(<), which is exact, while the ODE 
for q(t) requires an additional quasi-stationarity assumption. Numerical results 
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show that our model is able to capture well the behavior of the corresponding 
simulated system under different arrival rate functions. For instance, when using 
a step arrival rate function, the oscillations in the expectations, first observed in 
simulations, are nicely reproduced. 

Then we have shown how to apply familiar control theoretic methods to 
obtain guidelines in choosing the parameters of the system. We analyzed the 
response of the system to a step function and linearized the ODEs around the 
equilibrium point. This enabled us to set requirements on the system parameters 
such that e.g. the system does not exhibit oscillations around the equilibrium 
point. In the infinite buffer case this requirement was expressed in a closed form 
in terms of the system parameters. 

An important extension for future research is to complement the present 
analysis by a dynamic model for the population of TCP sources and to study 
the interaction of these two parts of the system. This is particularly important 
here as RED has been designed to function in cooperation with the TCP sources 
in situations of overload. 
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Abstract. Random Early Marking (REM) consists of a link algorithm, 
that probabilistically marks packets inside the network, and a source 
algorithm, that adapts source rate to observed marking. The marking 
probability is exponential in a link congestion measure, so that the end- 
to-end marking probability is exponential in a path congestion measure. 
Marking allows a source to estimate its path congestion measure and 
adjusts its rate in a way that aligns individual optimality with social 
optimality. We describe the REM algorithm, summarize its key proper- 
ties, and present some simulation results that demonstrate its stability, 
fairness and robustness. 



1 Introduction 

Flow control is a distributed algorithm to share network resources among com- 
peting sources. It often consists of two (sub)algorithms: a link algorithm executed 
inside the network at routers or switches, and a source algorithm executed at 
edge devices such as host computers or edge routers. The link algorithm detects 
congestion and feeds back information to sources, and in response, the source 
algorithm adjusts the rate at which traffic is injected into the network. The ba- 
sic design issue is what to feed back (link algorithm) and how to react (source 
algorithm), and the objective is to achieve stability, fairness and robustness. Ide- 
ally one should design the link and source algorithm jointly so that they work 
in concert to steer the network to track a possibly moving desirable operating 
point. This motivates a recent approach to flow control based on optimization 
e.g., [4,8,10,6,12,16,13,1,7,17,18,11,3], where the goal is to choose source rates 
to maximize a global measure of network performance. Flow control, both the 
link and the source algorithms, is derived as a distributed solution to this wel- 
fare maximization problem. Different proposals in the literature differ in their 
objective function, or solution approach, which lead to different link and source 
algorithms and their implementation. Though it may not be possible, nor crit- 
ical, that exact optimality is attained in practice, the optimization framework 
allows us to understand, and control, the behavior of the network as a whole. 

* This work is supported by the Australian Research Council under grants S499705, 
A49930405 and S4005343. 
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Indeed we may regard the sources and links as processors in an asynchronous 
distributed computation system and flow control as a computation to maxi- 
mize welfare. Under mild conditions on the welfare function, the computation 
can be proved to converge, i.e., the flow control algorithm is globally stable. 
Moreover, convergence can be maintained even in an asynchronous environment 
where sources and links communicate and update at different times, with dif- 
ferent frequencies, using outdated information, and feedback delays are different 
and time- varying [16]. 

The approach taken in [16] derives the flow control algorithm as a gradi- 
ent projection algorithm to solve the dual of the welfare maximization. It is 
remarkable that major TCP flow control schemes, Vegas, Reno, Reno/RED, 
Reno/REM, can all be interpreted in the same duality framework as carrying 
out variants of this basic algorithm with different source utility functions [15,14]; 
see also [9,11]. The purpose of this paper is to propose, and evaluate through 
simulations, a practical implementation (REM) of the basic algorithm of [16]. 
Like Vegas, the source algorithm of REM that we use in this paper assumes a 
log utility function. The link algorithm of REM probabilistically marks packets, 
like RED. A fundamental difference between REM and the other TCP schemes 
is the way it measures congestion and adjusts the congestion measure, which 
allows REM to achieve high utilization with negligible loss or queueing delay; 
see [2]. 

In Section 2 we summarize the REM algorithm and its key features. In Sec- 
tion 3 we describe our simulation setup. We then present a subset of our sim- 
ulation results on robustness to parameter setting. An extended version of this 
paper is [2]. 

2 Random Early Marking 

REM is a practical implementation, using binary feedback, of a basic flow control 
algorithm derived using duality theory to solve a optimal bandwidth allocation 
problem; see [16]. Here we summarize the REM algorithm and its key features. 

2.1 Algorithm 

For our purposes a network is a set L of links with finite capacities ci,l S L. It 
is shared by a set S of sources. A source s traverses a subset L{s) C L of links 
to the destination, and attains a utility Us{xs) when it transmits at rate Xs 
that satisfies 0 < mg < Xg < Mg < oo. REM is defined by the following link 
algorithm (1-2) and source algorithm (4-5). 

Each link I updates a congestion measure pi{t) in period t based on the 
aggregate input rate x\t) and the buffer backlog bi{t) at link 1: 

Pi{t+1) = [pi{t) +"/{aibi{t) + x\t) - ci)]~^ (1) 

where 7 > 0 and o/ > 0 are small constants and [z]"*" = max{2:,0}. Hence pi{t) 
is increased when the backlog bi{t) or the aggregate input rate x*(t) at link I 
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is large compared with its capacity c/(t), and is reduced otherwise. Note that 
the algorithm does not require per-flow information and works with any work 
conserving service discipline at the link. The adjustment rule (1) leads to a small 
backlog (6* ~ 0) and high utilization (i;** ~ c;) at bottleneck links I in equilib- 
rium. Link I marks each packet arriving in period t, that is not already marked 
at an upstream link, with a probability mi(t) that is exponentially increasing in 
the congestion measure pi (t) : 

mi{t) = 1 - ( 2 ) 

where </> > 1 is a constant. Once a packet is marked, its mark is carried to the 
destination and then conveyed back to the source via acknowledgement. 

The exponential form is critical for a multilink network, because the end-to- 
end probability that a packet of source s is marked after traversing a set L{s) of 
links is then 



/Gij(s) 

= 1 - (3) 

where = X^zgl(s) is the sum of link congestion measures along source 
s’s path, a path congestion measure. The end-to-end marking probability is high 
when p”{t) is large. 

Source s estimates this end-to-end marking probability m’^it) by the frac- 
tion rh’^ff) of its packets marked in period t, and estimates the path congestion 
measure by inverting (3): 

p%t) = -\og4l-m^{t)) (4) 

where log^ is logorithm to base 4>. It then adjusts its rate using marginal utility: 

xs{t) = ( 5 ) 

where is the inverse of the marginal utility, \z\\ = max{min{z, 6}, a}. If Ug 
is strictly concave, then exists and is strictly decreasing. Hence the source 
algorithm (5) says: if the path L{s) is congested (p®(t) is large), transmit at a 
small rate, and vice versa. 

For example, if Us{xg) = Wg logXs, azs > 0, then Xg{t) = Ws/p“{t)] if Us{xg) = 
— {Mg — XsY /2wg, 0 < Xg < Mg, then Xg{t) = Mg — Wgp‘^{t) if p^{t) < Mg/wg 
and 0 otherwise. 

The link marking probability (2) and the source rate (5) are illustrated in 
Figure 1. 

2.2 Smoothed REM 

We have found from simulations that a smoothed version of REM performs 
better especially when the end-to-end marking probabilities in the network take 
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(a) Marking probability 




(b) Source rate 

Fig. 1. (a) Marking probability m; = 1 — as a function of pi. (b) Source 
rate Xg = log^(l — m®)) as a function of to®. Here, (j) = 1.2 and Us{xs) = 

2 log Xs 



extreme values (close to 0 or 1). In smoothed REM, a source adjusts its window 
once every round trip time. For each adjustment, the window is incremented or 
decremented by 1 (or a small fraction, say, 10%, of the current window size) 
according as the target value determined by the price is larger or smaller than 
the current window. 



2.3 Key Features of REM 

Random Exponential Marking (REM) has three advantages. First it is ideally 
suited for networks with multiple congested links, where the end-to-end marking 
probability of a packet incorporates a congestion measure of its path. This allows 
a source to estimate its path congestion measure by observing the fraction of 
its packets that are marked. The use of marking as a means for sources to 
estimate information on their paths seems novel and applicable in other contexts. 
Second, by equalizing input rate x*^ with capacity c; and driving backlog b* to 
zero, the update rule (1) leads to very high utilization with negligible loss or 
queueing delay. Third, as we will see, under REM, the sources and links can be 
thought of as carrying out a stochastic approximation algorithm to maximize 
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the aggregate utility Us{xs) over the source rates Xs, s £ S, subject to link 
capacity constraints. As alluded to earlier, it is not critical that optimality is 
exactly attained in practice. It is however significant that REM attempts to 
steer the network as a whole towards a desirable operating point. Moreover this 
operating point can be chosen to achieve desired fairness. 

We have done extensive simulations to evaluate four properties of REM: 
stability, fairness, scalability and robustness. We now summarize our findings. 

REM can be regarded as a stochastic version of the basic algorithm in [12,16]. 
Though we have not proved analytically the stability and fairness of REM, our 
simulation results confirm that it inherits the stability and fairness properties 
of the basic algorithm. It is proved that the basic algorithm converge to the 
unique optimal that maximizes aggregate source utility even in an asynchronous 
environment [16, Theorems 1 and 2]. Moreover, the equilibrium can be chosen to 
achieve different fairness criteria, such as proportional [8] or maxmin fairness, by 
appropriate choice of source utility functions [16, Theorems 3 and 4]. Simulation 
results in later sections show that REM converges quickly to a neighborhood 
of the equilibrium, and then fluctuates around it. Hence the basic algorithm 
determines the macroscopic behavior of REM, including stability and fairness. 

A focus of our simulation study is to explore the scalability and robustness of 
REM. There are two aspects of scalability: complexity and performance. Both 
the link algorithm (1-2) and the source algorithm (4-5) use only local, and 
aggregate, information. Their complexity does not increase with the number of 
sources or the number of links or their capacities. Moreover they do not need to 
be restarted as network conditions, such as the link capacities, the set of sources, 
their routes or utility functions, change. Hence REM is applicable in a dynamic 
network even though it is derived from a static model. A critical issue however 
is whether performance scales. We present simulation results to demonstrate 
that REM’s performance, such as throughput, utilization, queue length and loss, 
remains stable when traffic load, link capacity, propagation delay, or network size 
is scaled up by a factor of 10. 

We evaluate robustness both with regard to parameter setting and to mod- 
eling assumptions. First REM is characterized by three main parameters: 7 that 
determines the rate of convergence, ai that trades off link utilization and de- 
lay, and (f) that affects the marking probability. The scalability experiments also 
demonstrate REM’s robustness to parameter setting, i.e., its performance re- 
mains stable in an environment that is drastically different from the nominal 
environment with respect to which the parameter values have been chosen. Sec- 
ond REM estimates round trip time in order to translate rate to window control. 
Simulations indicate that REM is robust to error in round trip time estimation. 

Due to space limitation we only present a subset of these simulation results 
to illustrate the scalability and robustness of REM to parameter setting. The 
full set of simulation results, pseudocodes and discussions are in [2]. 
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3 Simulation Setup 

We list here the network topology, source parameters, and link parameters that 
are common to most simulations. Other details that may vary across simulations 
will be given in the following subsections. 

All simulations are conducted for one of the two networks shown in Figure 2. 
The single (bottleneck) link network consists of n sources transmitting to a 




(a) Single (bottleneck) link network 




(b) Multilink network 

Fig. 2. Network topologies. In the single-link network, propagation delays vary 
across simulations. In the miiltilink network, short connections each has round 
trip propagation delay of 7ms, long connection 2n + Sms 



common destination. Each source is connected to a router via an access link 
and then to the destination via a shared link. In the multilink network, only the 
shared links are shown, not the access links. There are n shared links all with the 
same capacity, one long connection using all the n links and n short connections 
each using a single link as shown in the figure. This network is widely used in 
previous studies, e.g., in [5]. In both networks the shared link(s) has (have) a 
lower capacity than the access links and is (are) the only bottleneck(s). At each 
link packets are served in FIFO order. 

The utility functions of the REM sources are Ws log Xg , where Wg may take 
different values in different experiments. The source rate is controlled through 
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windowing, where the rate calculated at a source is converted to a window size by 
multiplying it by estimated round trip time. The sources are greedy and always 
exhaust the window. Destination immediately sends an acknowledgement on 
receipt of a packet. The maximum rate Mg for each source is 2 x c where c in 
packets/ms is the bottleneck link capacity. The minimum source rate mg is 0.1 
packets/ms. 

Our discrete time packet-level simulations are written in MATLAB. Unless 
otherwise specified, the main parameter values are ai = 0.1, 7 = 0.005. The 
value of 4> varies; see the following sections. 

4 Robustness to Parameter Setting 

In this section we present experimental results on scalability and robustness. As 
alluded to earlier, REM as defined by (1-2) and (4-5), involves only local and 
aggregate information, and hence its complexity scales. A critical issue however 
is whether its performance remains stable as we scale up the number of sources, 
the link capacity, the propagation delay, or the network size. We present four 
experiments to demonstrate that it does. These experiments also show that REM 
performs well across a wide range of network conditions. This makes tuning of 
its parameters easier. 

4.1 Traffic Load 

This set of 10 experiments shows that REM copes well as traffic load increases. 
Each experiment uses the single-link network of Figure 2(a) with n sources, 
n= 10, 20, ..., 100. All sources have the same utility function Us(a:s) = 12.51oga;s 
and round trip propagation delay of 10ms. The bottleneck link has a capacity 
of 25 packets/ms and a finite buffer of size 50 packets. The equilibrium price 
for the nth experiment, with n sources, is thus p*{n) = n/2. For all the 10 
experiments, the REM parameters are: 7 = 0.001, ai = 0.1, (p = 1.05. 

The results are shown in Figure 3. The equilibrium source rate, averaged over 
all sources, decreases steadily as the number of sources increases and matches 
well the theoretical value. The equilibrium link utilization remains above 96% 
while the equilibrium loss (< 0.2%) and backlog (< 10 packets) remains low. 

4.2 Capacity 

This set of 10 experiments are similar to the previous set, except that the number 
of sources is fixed at 20 but the link capacity is increased from 10 to 100 pkts/ms 
at 10 pkts/ms increment. The round trip propagation delay is 10ms and the 
buffer size is 40 pkts. REM parameters for all 10 experiments are: 7 = 0.005, 
ai = 0.1, 4> = l-l- 

The results are shown in Figure 4. The equilibrium source rate, averaged 
over all sources, increases linearly as link capacity increases and matches well 
the theoretical value. The equilibrium link utilization remains above 96% while 
the equilibrium loss (< 1%) and backlog (< 14 packets) remains low. 
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(a) Source rate 



(b) Link utilization 



(c) Loss and backlog 



Fig. 3. Scalability with traffic load. In (a) each bar represents the measured 
value and each star represents the theoretical value 




(a) Source rate 



(b) Link utilization 



(c) Loss and backlog 



Fig. 4. Scalability with link capacity. In (a) each bar represents the measured 
value and each star represents the theoretical value 
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4.3 Propagation Delay 

This set of 10 experiments are similar to the previous set, except that the link 
capacity is fixed at 20 pkts/ms but the round trip propagation delay is increased 
from 10 to 100 ms at 10 ms increment. For all 10 experiments, the buffer size is 
120 pkts and the REM parameters are: 7 = 0.001, ai = 0.1, (f) = 1.1. 

The results are shown in Figure 5. The equilibrium source rate, averaged over 



i ‘ 




pn:paga}icn delay (ma) 




(a) Source rate 



(b) Link utilization 



(c) Loss and backlog 



Fig. 5. Scalability with propagation delay. In (a) each bar represents the mea- 
sured value and each star represents the theoretical value 



all sources, remains steady as propagation delay increases and matches well the 
theoretical value. The equilibrium link utilization remains above 94% while the 
equilibrium loss (< 0.2%) and backlog (< 13 packets) remains low. 

4.4 Network Size 

A large network presents two difficulties. First it necessitates a small 7 > 0 in 
price adjustment, which leads to slower convergence. Second it makes price esti- 
mation more difficult, which often leads to wild oscillation and poor utilization. 
The second difficulty is exposed most sharply in the multilink network of Fig- 
ure 2(b). When the short connections all have the same utility functions the long 
connection sees a price that is n times what a short connection sees. It hence 
sees an end-to-end marking probability that is much larger than that short con- 
nections see. Extreme marking probabilities (when n is large) can lead to severe 
oscillation in the buffer occupancies. The next set of 10 experiments show that 
a small ai{= 0.1) reduces the effect of buffer oscillation on prices. This produces 
smoother price and window processes and a better utilitzation, improving the 
scalability of REM with network size. 

In the simulation, utility functions are Us{xs) = WslogXg, with wq for the 
long connection set to n times those wi = ■ ■ ■ = Wn for short connections, when 
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there were n links in the network. This is to achieve maxmin fairness, according 
to which the long connection should receive 50% of bandwidth for all network 
sizes. We measure both the throughput share of the long connection and link 
utilization. Throughput share is Xq/{xq + x*), i = where x* are the 

equilibrium rates of connections i, i = 0,1,..., n. Link utilization is Xq + x* 
at link i. The results are shown in Figure 6 for network sizes above 5. The 





(a) Throughput share of long con- (b) Link utilization 

nection 



Fig. 6. Scalability with network size. In (a) each bar represents the measured 
throughput share at each of the n links and each star represents the theoretical 
value. In (b) each bar represents the measured utilization at each of the n links 
and the straight line represents 100% utilization 



throughput share matches very well the theoretical value and the link utilization 
is very high. More importantly, the performance remains stable as network size 
increases. 

Figure 7(b) shows the result of another simulation with 20 links. Sources have 
identical utility parameters to achieve proportional fairness. The throughput 
shares of the long connection at each of the 20 links are shown in Figure 7(a). The 
window process for the 21 connections are shown in Figure 7(b). The performance 
is very close to expected. 

5 Conclusion 

We have presented a REM algorithm for flow control as a practical implemen- 
tation of the basic algorithm of [16]. Extensive simulations indicate that it is 
stable, fair, and robust; it achieves high utilization with negligible loss or queue- 
ing delay. REM owes its robustness and good performance fundamentally to the 
way it meausres congestion, see [2]. 
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(a) Thronghput share of long con- (b) Window size 

nection 

Fig. 7. Scalability with network size. In (a) the straight line shows the theoretical 
share of 1/21 for the long connection. In (b) the lower curve is the window 
process for the long connection, and the upper ones are those for the 20 short 
connections. Marking probability varies over [0.1972,0.9876] 
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Abstract. Many schemes have been proposed to support TCP traffic in 
a Differentiated Services network. We present in this paper an analytical 
model to study the performance of these schemes. The model is based 
on a Markovian fluid approach. We provide first a general version of the 
model, then we specify it to the different proposed schemes. For each 
scheme, we calculate the throughput achieved by a TCP connection. 
We compare then their service differentiation capacity under different 
subscription levels, different reservations, and different round-trip times. 



1 Introduction 

There has been an increasing interest these last years in enhancing the best effort 
service of the Internet to provide new applications with some guarantees in terms 
of bandwidth, losses, and end-to-end delay. Differentiated Services architecture 
(DiffServ) is considered as the most promising approach in this field for reasons 
of scalability and incremental deployment [3,14]. Flows are monitored at the 
edge of the network. Their parameters (rate, burst size) are compared to the 
contract signed between the user and the service provider. Compliant packets 
are marked with a high priority. Non-compliant packets are shaped, rejected, 
or injected into the network with a low priority. In the core of the network, 
priority packets are privileged over non-priority ones. This privilege can be in 
the form of a better scheduling (e.g., priority scheduling) as in the Premium 
service architecture [11,14], or in the form of a lower drop probability as in the 
Assured service architecture [3,8] . The main advantage of the DiffServ framework 
is that packets in the network are treated as a function of their priority not as a 
function of the flow they are belonging to. This makes the framework scalable, 
flexible, and easy to introduce into the Internet. 

The utility of such framework to applications using UDP, the best effort 
transport protocol of the Internet, is evident. An example of such applications is a 
real time video or audio communication tool. If the network is well dimensioned, 
these applications are able to realize the throughput they desire. The problem 
appears with applications using TCP [10], the connection-oriented transport pro- 
tocol of the Internet. An application using TCP may ask the network for a better 

* An extended version of this paper is available upon reqnest from the anthors. 



J. Crowcroft, J. Roberts, and M. Smirnov (Eds.): QofIS 2000, LNCS 1922, pp. 55—67, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 



56 



Chadi Barakat and Eitan Altman 



service (i.e., more throughput) by reserving a certain bandwidth. If at the edge 
of the network non-compliant packets are rejected, TCP will reduce its rate when 
it reaches the reserved bandwidth. The application fails in this case to use the 
bandwidth it is paying for as well as any unreserved bandwidth in the network. 
The solution to this problem is to let non-compliant packets get into the network 
as low priority packets. This improves TCP performance since the rate can now 
reach larger values. But, TCP is not aware of the reservation. The loss of a low 
priority packet is not distinguished from the loss of a high priority packet and 
the rate is reduced in the same manner. This has been shown to result in an un- 
fairness in the distribution of network resources [2,3,5,6,17,18]. The main result 
of these studies is that TCP is unable to realize its target throughput in a Diff- 
Serv network. The target throughput is defined as the reserved bandwidth plus 
a fair share of any unreserved bandwidth. A connection with a small reservation 
has been shown to achieve better performance than a connection with a large 
reservation. These works show also the well known problem of TCP unfairness 
in presence of different round-trip times (RTT). A connection with small RTT 
achieves better performance than a connection with long RTT. Some solutions 
have been proposed to alleviate these problems. They consist in either changing 
TCP sources, or marking TCP flows differently at the edge of the network, or 
changing the behavior of network routers. The performance of these solutions 
has been often evaluated via simulations in [3,5,6,17]. In [18], a mathematical 
model has been proposed to calculate the throughput of a connection as a func- 
tion of the drop probability of its packets. Three schemes have been compared. 
But, this model is not able to study the impact of the parameters of the other 
connections (e.g., RTT, reserved bandwidth). Also, it makes the simplistic as- 
sumption that TCP window varies in a cyclic manner with all the cycles having 
the same duration. Our experimentations over the Internet have shown that this 
assumption is not so realistic [1]. 

In this paper we present a general Markovian model able to a) calculate the 
performance of all the connections sharing a bottleneck b) account for the differ- 
ent solutions already proposed, or to be proposed. Using this model, we compare 
the performance of some schemes proposed to support TCP in a DiffServ net- 
work. In the next section we outline the different schemes we are considering 
in this paper. In Section 3 we explain our model. In Section 4 we calculate the 
throughput of a TCP connection as a function of two parameters. By appropri- 
ately setting these two parameters, we are able to specify our model to any one 
of the proposed schemes. Section 5 explains how these two parameters must be 
set. In Section 6 we present some numerical results. 

2 TCP in a DiffServ Network 

The main objectives of a DiffServ scheme supporting TCP traffic are: 

— The available bandwidth must be efficiently utilized. 

— In the case where the sum of the reservations is less than the total throughput 
(the under-subscription case), each connection must realize its reservation. 
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The difference between the total throughput and the total reservation must 
be shared equally between the different connections. 

~ In the case where the sum of the reservations is greater than the total 
throughput (the over-subscription case), the total throughput must be dis- 
tributed between the different connections proportionally to their reserva- 
tions. 

The original proposition to support TCP in a DiffServ network is due to 
Clark [3]. Two priority levels have been proposed for packet marking. A packet 
that arrives at the edge of the network and finds the rate of the connection 
smaller than the reservation, is marked with a high priority and is called an IN 
packet, otherwise it is marked with a low priority and it is called an OUT packet. 
The service differentiation comes from the different probabilities network routers 
drop these two types of packets at the onset of congestion. A variant of RED 
(Random Early Detection) buffers [7] is proposed to implement this difference 
in the drop probability. This variant, called RIO (RED IN/OUT) [3], has two 
minimum thresholds instead of one. When the average queue length exceeds the 
lower minimum threshold, OUT packets are probabilistically dropped in order 
to signal congestion to TCP sources. The buffer starts to drop probabilistically 
IN packets when the average queue length (sometimes the average number of IN 
packets in the buffer) exceeds the upper minimum threshold. This scheme has 
however some problems to satisfy the above objectives. Due to the saw tooth 
window variation of TCP (Figure 1), a connection is obliged to transmit a certain 
amount of OUT packets in order to realize its reservation. Since OUT packets 
are very likely to be dropped, the reservation may not be realized. Moreover, 
a connection with a large reservation is obliged to transmit more OUT packets 
than a connection with a small reservation. Also, a connection with a large 
reservation has in general larger window than that of a connection with a small 
reservation which makes it more affected by the loss of an OUT packet. This 
explains the bias of the scheme proposed in [3] against connections with large 
reservations. 

The first and the most intuitive solution to this problem is to change TCP in 
a way that the source reduces differently its window when OUT or IN packets are 
lost [5,17]. The source is supposed to know the priority of the lost packet. Also, 
it is supposed to know the bandwidth reserved by the connection. The loss of an 
IN packet is an indication that the congestion window must be divided by two as 
in standard TCP. The loss of an OUT packet is an indication that the unreserved 
bandwidth in the network is congested. The source reduces then its window by 
half the number of OUT packets it has in the network. The main problem with 
this solution is that it requires a change at the source and a knowledge of the 
priority of the lost packet which are two difficult tasks to implement. 

The other solutions try to help connections with large reservations to send 
more OUT packets than connections with small reservations, and this is without 
changing TCP algorithms. The first solution proposed in [3] is based on the saw 
tooth variation of TCP window. To be able to realize its reservation, a TCP 
connection must be protected from the other connections until it reaches 4/3 
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of its reservation (see Figure 1). The idea [3] is to change the marker so that 
it marks packets as OUT when the rate of the connection exceeds 4/3 of the 
reserved bandwidth. We call this proposition the Saw Tooth Marking scheme. 
It has the drawback of injecting into the network during some periods more IN 
packets than we promised it. 

The second solution [17] proposes to drop OUT packets in network routers 
according to the reserved bandwidth. The authors [17] show that dropping OUT 
packets with a probability inversely proportional to the bandwidth reserved by 
the connection improves the performance. We call this solution the Inverse Drop 
Probability scheme. Its main drawback is that it requires that network routers 
know the bandwidth reserved by every connection. 

The last scheme we consider is the one that proposes to mark the packets with 
three priority levels instead of two [8,9,17]. A RED buffer with three thresholds 
is used. The idea is to protect the OUT packets of a connection transmitting at 
less than its fair share from the OUT packets of another connection exceeding 
its fair share, by giving them some medium priority while giving low priority to 
those of the latter connection. The difference from Saw Tooth Marking is that 
we are not injecting into the network more high priority packets than what we 
promised. 



3 The Markovian Fluid Model 

Consider N long life TCP connections sharing a bottleneck of bandwidth /i. 
Let X^{t) be the transmission rate of connection i at time t. It is equal to the 
window size divided by the RTT of the connection. The connections increase 
their rates until the network gets congested. The congested router starts to drop 
packets in order to signal the congestion to TCP sources. A source receiving 
a congestion signal (i.e. by detecting the loss of packets) reduces its rate, then 
it resumes increasing it. Let tn denote the time at which the nth congestion 
event occurs and let Dn = tn+i ~ tn- Denote by the transmission rate of 
connection i at time and by X^^ its transmission rate just after tn (after 
the disappearance of the congestion). Xi^_^ is equal to X^ if connection i did 
not reduce its rate at tn and to Ri{X^) otherwise. Ri{X^) is a function of X^ 
usually equal to X^/2 [10]. We introduce now some assumptions: 

Assumption 1: We assume first that queueing times in network nodes are small 
compared to the propagation delay. This holds with the active buffer manage- 
ment techniques (e.g., RED [7]) that keep the queue length at small values. 
The RTT of a connection say i is then approximately constant equal to Ti and 
the rate of the connection can be considered to increase linearly with time be- 
tween congestion events (by one packet every bTi, with b equals the number of 
data packets covered by an ACK). We can write -I- UiDn- This 

is ture if the slow start phases and timeout events are rare which is the case 
of new versions of TCP in case of long transfers [4]. We can also consider that 
the congestion appears when the sum of the transmission rates reach p. Thus, 
instants tn are given by = T- 
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Assumption 2: The second assumption we made is that only one connection 
reduces its rate upon a congestion and that the probability that a connection 
reduces its rate is a function of its rate and the rates of the other connections 
at the moment of congestion. This is again the aim of the new buffer manage- 
ment techniques (e.g. RED [7]) that implement random drop in order to sent 
congestion signals to connections consuming more than their fair share of the 
bottleneck bandwidth while protecting connections consuming less than their 
fair share [7]. We assume that the reaction of the connection receiving the first 
congestion signal is quick so that the congestion in the network disappears before 
other packets from other connections are dropped. 

Let be a random variable equal to 1 if source i reduces its rate at time tn 
and to 0 otherwise. Using Assumption 2, we have always J2i=i^n = 1- The 
probability that is equal to one is a function of the connection rates at 
time tn- Let pi{X^,X ^, . . . , A^) denote this probability. It represents the prob- 
ability that the dropped packet upon congestion belongs to connection i. This 
probability together with Ri{Xl^) form the two parameters that need to be ap- 
propriately chosen in order to cover all the proposed schemes. 

Theorem 1. The process {X^,X^, . . . ,X^} can be described as a homogeneous 
Markov process of dimension N — 1. 

Proof. The transmission rates at time tn are related by = M- Thus, 

the problem can be analyzed by considering only the rates of A — 1 connections. 
In the particular case of A = 2 we get a one-dimensional problem. Concerning 
the Markovian property of the model, it is easy to show that the state of the 
process at time depends only on its state at time Indeed, for any i we 
have, 

Xl^^=Xl + Uf{R,{Xl,)-Xl,) + oc.Dn. ( 1 ) 

Summing over all the f, we get, 

^ TJ^=,Uf{Xf-R,{Xf)) 

^ri — ^ ■ (^) 

Given that Ri{Xf) and the value taken by Uf are only a function of the process 
state at time tn, is only a function of the process state at time and 

the Markovian property holds. The process is homogeneous since its state at 
time tra-i-i depends only on its state at time and not on n. o 



The transmission rate of a connection takes a finite number of values between 
0 and pL. This number depends on the RTT of the connection and the packet size. 
Denote by X the state space of our chain. For each state X = (xi, . . . , xn) S X, 
the chain can jump to A different states at the next congestion event. This 
depends on which source reduces its rate. Denote by F^{X) = {fl , . . . , /^) the 
next state when the system is in state X and source i reduces its rate. Using (1), 

fi — j T {Xi — Ri{Xi))aj / 'Yhn%=i 

^ \Ri{x{) + {Xi- Ri{xi))ai/Yl,Z-=i^- 



if jV * 

\ij = i 
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Let V = {Pxy)x,y^x denote the transition matrix of the chain. We have, 

p _ ( Pi{X) if there is an i such that Y = F’'{X) 
y 0 otherwise 

Denote by 77 = (nx)xex the stationary distribution of our chain. We solved 
the system numerically for different scenarios and we found that 77 always exists 
and is unique. Let X*, 7/*, and D represent the processes X^, U^, and 77„ in the 
unique stationary regime. 



4 Calculation of the Throughput 



The throughput of a connection say i (or the time average of its transmission 
rate) is equal to 



X* = lim - 
t— *^oo t 



X^{u)du = lim 






0 

= lim " 



y^n-1 JJ 

Z^m=0 






- D 

n Z^m=0 

E[{X^ + W{R,{X^) - X^))D + a.(77)V2] 
E[D] 



( 3 ) 



Let (X) denote the time until the next congestion event when the system 
is in state X G X and source j reduces its rate. Using (2) we have, D^{X) = 
{xj - Rj{xj))/J2l=i am- Thus, X* = 

Ex^x^x(J:Upj(^) (x,77^(X) + a,(77^(X))V2)+K(X)(7?,(a;,)-x077*(X)) 

( 4 ) 



5 Application of the Model to Real Schemes 

Suppose that the system is in state X = {xi, . . . ,xx) G X in the stationary 
regime. In order to calculate the throughput, we find in this section the expres- 
sions of the two functions Pi{X) and Ri{xi) for every scheme. 

5.1 Standard TCP with RIO 

A source i asks the network for a bandwidth pi. When the congestion ap- 
pears at the bottleneck, the router starts to drop OUT packets with a cer- 
tain probability. If there is no OUT packets in the network, congestion remains 
and the router starts to drop probabilistically IN packets. When an OUT or 
IN packet is dropped, the corresponding connection divides its rate by two. 
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Thus, Ri(xi) = Xij2 in this case and in all the subsequent cases where standard 
TCP is used. 

The probability that a connection reduces its rate upon a congestion is equal 
to 0 when it is transmitting only IN packets and there is at least one connection 
transmitting OUT packets. It is equal to 1 if it is the sole connection transmitting 
OUT packets. Now, we study the case where the connection is transmitting 
OUT packets together with other connections. The other case, that of all the 
connections transmitting only IN packets, will be directly deduced. 

Let q be the probability that an OUT packet is dropped at the bottleneck 
upon congestion. Let V be the result of the probabilistic drop applied to a packet. 
It is equal to 1 if the packet is really dropped and to zero otherwise. Denote by V 
the number of the connection to which the dropped OUT packet belongs. In the 
following we denote by Px{A) the probability that event A happens given that 
the system is in state X X upon congestion. We have. 



p,{X) = Px{Y = i\V = 1) 



Px{Y = i).Px{V = l\Y = i) 
E™=i Px{Y = m).Px(V = 1|F = m) 



Px{V = 1\Y = m) is no other than q. Thus, Pi{X) is equal to Px{Y = i) which 
is equal to the ratio of the rate at which connection i is sending OUT packets 
and the total rate at which OUT packets are sent. Thus, 



p,{X) = Px{Y = i) 



Xi - fj,i 



^ Mm} 



where !{} is the indicator function. 

Similarly, when all the connections are transmitting only IN packets, Pi{X) is 
equal to Px(Y = i) which is equal to the ratio of the rate at which connection i 
is sending IN packets and the total rate at which IN packets are sent. It follows. 



P^{X) 



Xtjp. if Sm=l > Mm} = 0 

((xi - pl^)l{x^ > Mi}) / (z)m=i(^m - Mm)l{xm > Mm}) Otherwise 



5.2 Modified TCP with RIO 

Pi{X) is the same as in the previous section. The difference is in the func- 
tion Ri{xi). If an IN packet is lost, the source divides its rate by two as with 
standard TCP. If the dropped packet is an OUT packet, only the rate of OUT 
packets is divided by two. We consider in our model that the dropped packet 
from connection i is an IN packet if at the moment of congestion source i is 
transmitting at less than its reservation, otherwise it is an OUT packet. Thus, 



R (x ) = I 

} Mi + {xi - Mi)/2 



if Xt < Mi 
otherwise 
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5.3 Inverse Drop Probability Scheme 

Standard TCP is used, therefore Ri{xi) is equal to Xi/2. The difference in this 
case is that packets of different connections (IN or OUT) are not treated in the 
same manner in the core of the network. The idea proposed in [17] is to drop 
OUT packets from a connection with a probability that varies as the inverse of 
its reservation. However, the drop probability of IN packets is not specified. IN 
packets are actually dropped when all the connections are transmitting at less 
than their reservations. In this case and according to our objectives (Section 2), 
the throughput of a connection must be proportional to its reservation. It is 
known that the throughput of a TCP connection varies as the square root of the 
packet drop probability [1,13,15]. Thus, we propose to drop IN packets with a 
probability that varies as the inverse of the square of the reservation. We add 
this new feature to the proposed scheme. 

As in the case of RIO with standard TCP, a connection reduces its rate 
with probability 1 if its the sole connection exceeding its reservation and with 
probability 0 if it is transmitting only IN packets and if there is at least one 
connection transmitting OUT packets. For the remaining two cases, we consider 
first the case when the connection is transmitting OUT packets together with 
other connections. The other case will be directly deduced. 

Suppose that the bottleneck router drops OUT packets of source m with a 
probability q/ fim, g is a constant. Then, 



^ Px{Y = ^).Px{V = l\Y = ^) ^ Px(Y = t)lfi, 

El=iPx{Y = m).Px(V = l\Y = m) E1 =i Px (Y = m) / 

Xi/Hi - 1 



m 

The general expression of Pi{X) for this scheme is given by 



MX) = 



if Em=l > Mm} = 0 

- l)l{/ii > Xi}) / l)l{xm> fJ-mfj Otherwise 



5.4 Saw Tooth Marking Scheme 

Standard TCP, two priority levels, and RIO buffers are used. Thus, Ri{xi) = 
Xi/2. The difference here is in the marker operation. The flow of connection i 
contains OUT packets when its rate Xi exceeds 4/ri/3. The rate of its OUT 
packets at the moment of congestion is equal to Xi — 4pi /3 rather than Xi — fit- 
The new expression oi pi{X) is then 



P^{X) = 



x^/^J, if I]m=i l{a;m > 4^m/3| = 0 

{{xi - 4p^/3)l{xi > 4/r*/3})/ - 4p,rn/3)Hxm > 4Mm/3}) 

otherwise 
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5.5 Standard TCP with Three Drop Priorities 

In this scheme the source makes two reservations instead of one. Denote these 
reservations by and /if with /if < /if- Standard TCP is used at the source, 
then Ri{xi) = Xi/2. Packets are marked with three priority levels or three colors. 
Packets exceeding /tf are marked with low priority (red color). Those exceeding 
/if but not /if are marked with medium priority (yellow color). Packets sent at 
a rate slower than /if are marked with a high priority (green color) . 

As in the RIO case, the network starts first to drop low priority packets. This 
happens when one of the sources, say i, is exceeding its upper reservation /tf . If 
those packets don’t exist, medium priority packets are dropped. Medium priority 
packets exist in the network when one of the sources say i is exceeding its lower 
reservation /tf. If it is not the case, the network drops high priority packets. A 
connection reduces its rate with probability one if it is transmitting alone above 
a certain level. It reduces its rate with probability 0 if it is transmitting below a 
level and there is another connection transmitting above the same level. In the 
other cases, the probability that a connection reduces its rate is equal to the 
probability that the dropped packet belongs to this connection. Similarly to the 
RIO case we write. 



P^{X) = 



' Xi/p if Em=l HXm > pin} = 0 

{{Xi - pI)1{x^ > p\}) / (Y}l=liXm - pln)Hx„, > /tf„}) 

if Em=l ^{Xm > pin} > 0 and YZ=1 ^{Xm > pln} = 0 
^ [i.Xi - pDHxi > /rf}) / (j2l=iixra - plnlHxm > plnij Otherwise 



To compare this scheme to previous ones, p] and /tf must be set as a function 
of the desired throughput pi. We looked at the saw tooth variation of TCP rate 
(Figure 1). On average and in order to realize a throughput pi, the connection 
rate should vary between 2/ti/3 and Thus, we give packets below 2/ti/3 

the highest priority, packets between 2fj,if3 and 4,pi/3 the medium priority, and 
packets above 4/ii/3 the lowest priority. This corresponds to pj = 2/ii/3 and 
/if = 4/ii/3. This scheme is compared in the sequel to the other schemes with 
these particular values of the two reservations. Other values could be however 
used. 



6 Some Numerical Results 

We solve numerically our model for the case of two concurrent TCP connections. 
We take p =1.5 Mbps and we set TCP segments to 512 bytes. Reservations are 
expressed in kbps. The receivers are supposed to acknowledge every data packet 
{h = 1). First, we give the two connections the same RTT (100ms) and we 
study the performance of the different schemes under different reservations and 
different subscription levels. Second, we study the impact of a difference in RTT 
on the service differentiation provided by the different schemes. 
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Fig. 1. The saw tooth variation of Fig. 2. Performance comparison for a 
TCP rate 50% total reservation 





Fig. 3. Performance comparison for a Fig. 4. Performance comparison for a 
100% total reservation 150% total reservation 



6.1 Impact of the Reservation 

We change the reservations of the two sources in a way that their sum is constant 
and equal to p/i; p indicates the level of subscription. We consider three values 
of p: 0.5, 1 and 1.5. For each p and according to the objectives in Section 2, 
we define a factor F that characterizes how much connection 1 is favored with 
respect to connection 2. For p < 1, the network is under-subscribed and the 
two sources must share fairly the excess bandwidth. We define F as the ratio of 
— pi and — p 2 - The optimum scheme is the one that gives the closest F 
to one. An F > 1 means that the scheme is in favor of connection 1. For p > 1, 
the network is over-subscribed. The bandwidth must be shared proportionally 
to the reservation. We define F in this case as being the ratio of X^ j p\ and 
X^ ! P 2 - Again, the optimum scheme is the one that gives the closest F to one, 
and an F > 1 means that the scheme is in favor of connection 1. 

In Figures 2, 3, and 4, we plot F for the three cases p = 0.5, 1, and 1.5. 
The X-axis shows the reservation of source 1. For all the schemes and as one 
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must predict, F converges to 1 when the the reservation of source 1 moves to 
that of source 2. In the under-subscription case, the original RIO scheme gives 
the worst service. The connection with the small reservation achieves better 
performance than that with the large reservation. The other schemes improve 
the service. They give connection 2 more chance to increase its rate above its 
reservation which improves its throughput. In the over-subscription case the 
situation changes. This is more depicted in Figure 4. In this case, the original 
RIO scheme gives better performance than the proposed schemes (except for 
the Three Color scheme). The problem here is that the source with the large 
reservation is transmitting almost always IN packets and cannot profit from the 
high priority we give to OUT packets. The increase in the priority of OUT packets 
helps the source with the small reservation which achieves better throughput. 

6.2 Impact of Round-Trip Time 

We study in this section how well the proposed schemes resist to a difference in 
RTT. We suppose that the two connections are asking for the same bandwidth 
{^i = ^ 2 )- We set T 2 to 50ms and we vary Ti between 50ms and 500ms. Ideally, 
the two connections must achieve the same throughput independently of their 
RTT. To quantify the impact of Ti, we use the Fairness Index defined in [12]: 
FI = {X^+X'^Y/ (2((X^)^ -I- . This is an increasing function of fairness. 
It varies between 1/2 when one of the two connections is shut down, and 1 when 
the two connections realize the same throughput. We plot in Figures 5, 6, 7, 
and 8, FI as a function of T 1 /T 2 for four values of p: 0, 0.5, 1 and 1.5. The zero 
reservation case corresponds to a best effort network. All the schemes achieve the 
same performance (Figure 5). The fairness deteriorates as Ti increases. The small 
RTT connection (i.e., 2) gets better and better performance. A small reservation 
as for p = 0.5 protects the long RTT connection and improves the service. 
Indeed, as Ti starts to increase, the throughput of connection 1 drops, but at 
a certain point, it fells below its reservation and the connection starts to send 
only high priority packets. It becomes then protected from the other connection. 
The schemes other than RIO with standard TCP improve further the service. 
With these schemes, the long RTT connection has more chances to stay above 
its reservation. In the over-subscription case the situation again changes. RIO 
with standard TCP gives better performance than that of the other schemes. 
The reason is that giving a connection more chances to exceed its reservation 
helps the connection with small RTT rather than the connection with long RTT. 

6.3 Discussion of the Results 

We summarize our results as follows: In the under-subscription case, the RIO 
scheme is more biased against large reservation connections and long RTT ones 
than the other schemes. This is not the case in the over-subscription case where 
it provides a better performance. Except for the Three Color scheme, the other 
schemes are useful in the first case. The Three Color scheme is useful in all the 
cases since the priority it gives to OUT packets is less than that of IN packets. 
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Fig. 5. Fairness index for a 0% total Fig. 6. Fairness index for a 50% total 
reservation reservation 





Fig. 7. Fairness index for a 100% total 
reservation 



Fig. 8. Fairness index for a 150% total 
reservation 
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Abstract. Many distributed multimedia applications have the ability to 
adapt to fluctuations in the network conditions. By adjusting temporal 
and spatial quality to available bandwidth, or manipulating the playout 
time of continuous media in response to variations in delay, multimedia 
flows can keep an acceptable QoS level at the end systems. In this study, 
we present a scheme for adapting the transmission rate of multimedia 
applications to the congestion level of the network. The scheme called 
the direct adjustment algorithm (DAA), is based on the TCP congestion 
control mechanisms and relies on the end-to-end Real Time transport 
Protocol (RTP) for feedback information. Our investigations of the DAA 
scheme suggest that simply relying on the the TCP-throughput model 
might result under certain circumstances in large oscillations and low 
throughput. However, DAA achieves, in general, high network utilization 
network and low losses. Also, the scheme is shown to be fair towards 
competing TCP traffic. However, with no support from the network, 
long distance connections receive less than their fair share. 



1 Introduction 

While congestion controlled TCP connections carrying time insensitive FTP or 
WWW traffic still constitute the major share of the Internet traffic today [1], 
recently proposed real-time multimedia services such as IP-telephony and group 
communication will be based on the UDP protocol. While UDP does not offer any 
reliability or congestion control mechanisms, it has the advantage of not intro- 
ducing additional delays to the carried data due to retransmissions as is the case 
with TCP. Additionally, as UDP does not require the receivers to send acknowl- 
edgments for received data, UDP is well suited for multicast communication. 
However, deploying non-congestion controlled UDP in the Internet on a large 
scale might result in extreme unfairness towards competing TCP connections as 
TCP senders react to congestion situations by reducing their bandwidth con- 
sumption and UDP senders do not. Therefore, UDP flows need to be enhanced 
with control mechanisms that not only aim at avoiding network overload but 
are also fair towards competing TCP connections, i.e, be TCP-friendly. TCP- 
friendliness indicates here, that if a TCP connection and an adaptive flow with 



J. Crowcroft, J. Roberts, and M. Smirnov (Eds.): QofIS 2000, LNCS 1922, pp. 68—79, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 



The Direct Adjustment Algorithm: A TCP-Friendly Adaptation Scheme 



69 



similar transmission behaviors have similar round trip delays and losses both 
connections should receive similar bandwidth shares. As an oscillative perceived 
QoS is rather annoying to the user, multimedia flows require stable bandwidth 
shares that do not change on the scale of a round trip time as is the case of TCP 
connections. It is, thus, expected that a TCP-friendly flow would acquire the 
same bandwidth share as a TCP connection only averaged over time intervals 
of several seconds or even over the entire life time of the flow and not at every 
time point [2]. 

In this paper, we propose an end-to-end rate adaptation scheme called the 
direct adjustment algorithm (DAA) for adjusting the transmission rate of mul- 
timedia applications to the congestion level of the network. DAA is based on 
a combination of two approaches described in the literature, namely: additive 
increase/multiplicative decrease (AIMD) schemes proposed in [3,4] and an en- 
hancement of the TCP-throughput model described in [5]. 

In Sec. 2, we present a general overlook on some of the TCP-friendly schemes 
currently proposed in the literature. 

Sections 3 presents an approach for adapting the transmission rate of end 
systems to the network congestion state using RTP based on the AIMD approach 
and discusses unfairness of such schemes towards competing TCP connections. 
The direct adjustment algorithm is presented in Sec. 4. The performance of 
the scheme in terms of bandwidth utilization as well as the behavior of TCP 
connections traversing the same congested links is then investigated in Sec. 5 
and Sec. 6. 

2 Background and Related Work 

Recently, there has been several proposals for TCP-friendly adaptation schemes 
that either use control mechanisms similar to those of TCP or base the adapta- 
tion behavior on an analytical model of TCP. 

Rejaie et al. present in [6] an adaptation scheme called the rate adapta- 
tion protocol (RAP). Just as with TCP, sent packets are acknowledged by the 
receivers with losses indicated either by gaps in the sequence numbers of the 
acknowledged packets or timeouts. The sender estimates the round trip delay 
using the acknowledgment packets. If no losses were detected, the sender peri- 
odically increases its transmission rate additively as a function of the estimated 
round trip delay. After detecting a loss the rate is multiplicatively reduced by 
half in a similar manner to TCP. 

Jacobs [7] presents a scheme called the Internet-friendly protocol that uses 
the congestion control mechanisms of TCP, however, without retransmitting 
lost packets. In this scheme, the sender maintains a transmission window that is 
advanced based on the acknowledgments of the receiver which sends an acknowl- 
edgment packet for each received data packet. Based on the size of the window 
the sender estimates then the appropriate transmission rate. 
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Padhye et al. [5] present an analytical model for the average bandwidth share 
of a TCP connection (rTCp) 



with M as the packet size, I as the loss fraction, tout as the TCP retransmission 
timeout value, tRXT as the round trip delay and D as the number of acknowledged 
TCP packets by each acknowledgment packet. 

Using this model Padhye et al. [8] present a scheme in which the sender esti- 
mates the round trip delay and losses based on the receiver’s acknowledgments. 
In case of losses, the sender restricts its transmission rate to the equivalent TCP 
rate calculated using Eqn. 1 otherwise the rate is doubled. 

Additionally, various schemes have been proposed for the case of multicast 
communication such as [9,10,11] that aim at using a TCP-friendly bandwidth 
share on all links traversed by the multicast stream. 

3 Additive Increase and Multiplicative Decrease 
Adaptation Using RTF 

When designing an adaptive control scheme, following goals need to be consid- 
ered: 

— to operate with a low packet loss ratio 

— achieve high overall bandwidth utilization 

— fairly distribute bandwidth between competing connections 

Whereas fairness in this context does not only mean equal distribution of band- 
width among the adaptive connections, but being friendly to competing TCP 
traffic as well. Various adaptation algorithms proposed in the literature [3,4,12] 
were shown to be efficient in terms of the first two goals of low losses and high 
utilization but neglected the fairness issue. As those schemes do not consider 
the bandwidth share of TCP traffic traversing the same congested links, such 
algorithms might lead to the starvation of competing TCP connections. 

With most of the adaptation schemes presented in the literature [6,5] the 
sender adapts its transmission behavior based on feedback messages from the 
receiver sent in short intervals in the range of one or a few round trip delays. 
This is particularly important for the case of reliable transport where the sender 
needs to retransmit lost packets. Additionally, with frequent feedback messages 
the sender can obtain up-to-date information about the round trip delay and, 
hence, use an increase in the round trip delay as an early indication of possible 
congestion. 

On the contrary, in this paper we investigate adaptation schemes that use 
the real time transport protocol (RTP) [13] for exchanging feedback information 
about the round trip time and the losses at the receiver. As RTP is currently 
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being proposed as an application-level protocol for multimedia services over the 
Internet, using RTP would ease the introduction of adaptation schemes in the 
context of such services. RTP defines a data and a control part. For the data part 
RTP specifies an additional header to be added to the data stream to identify 
the sender and type of data. With the control part called RTCP, each member of 
a communication session periodically sends control reports to all other members 
containing information about sent and received data. However, with RTP, the 
interval between sending two RTCP messages is usually around five seconds. The 
in-frequency of the RTCP feedback messages dictates that an RTP sender can 
not benefit fast enough from rapid changes in the network conditions. Thus, the 
goal of RTCP-based adaptation is to adjust the sender’s transmission rate to the 
average available bandwidth and not react to rapid changes in buffer lengths of 
the routers for example. This might be actually more appropriate in some cases 
than rapidly changing the transmission rate at a high frequency. 

As an example for this behavior, we tested the approach described in [4] . This 
scheme has great resemblance to the schemes described in [3,12,14] and is based 
on the same AIMD approach. With this approach the sender reduces its trans- 
mission rate by a multiplicative decrease factor after receiving feedback from the 
receiver indicating losses above a certain threshold called the upper loss thresh- 
old. With losses below a second threshold called the lower loss threshold the 
sender can increase its transmission rate additively. For the case that the feed- 
back information indicates losses in between the two thresholds the sender can 
maintain its current transmission level, see Fig. 3. Reducing the rate multiplica- 
tively allows for a fairer reaction to congestion. That is, connections utilizing a 
disproportionately large bandwidth share are forced to reduce their transmission 
rate by a larger amount. 



loss (%) 



100 



packet loss 



X 

low-pass filter 
Fig. 1. AIMD algorithm 
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To test the behavior of TCP connections sharing the same congested links 
with connections using this scheme, we simulated the topology depicted in Fig. 2. 
One TCP connection is sharing a bottleneck router with two connections using 
the adaptation approach just described. The connection has a round trip time of 
10 msec and a bandwidth of 10 Mb/s. The TCP source is based on Reno-TCP. 
That is, the TCP source reduces its transmission window by half whenever loss 
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is indicated. The adaptive connections have a lower loss threshold of 5% and 
a higher one of 10%. The additive increase factor was set to 50 kb and the 
multiplicative decrease factor to 0.875. Those values were suggested in [4] to be 
most appropriate based on measurements. The router is a random early drop 
(RED) gateway as was proposed by Floyd and Jacobson [15]. A RED gateway 
detects incipient congestion by computing the average queue size. When the 
average queue size exceeds a preset minimum threshold the router drops each 
incoming packet with some probability. Exceeding a second maximum threshold 
leads to dropping all arriving packets. This approach not only keeps the average 
queue length low but ensures fairness and avoids synchronization effects. In all 
of our simulations presented in this paper the minimum drop threshold of the 
router was set to 0.5 and the maximum one to 0.95 based on results achieved 
in [2]. 




Fig. 2. Test configuration for the interaction of the adaptive schemes and TCP 
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Fig. 3. Bandwidth distribution with an AIMD adaptation scheme 



As Fig. 3 shows, the adaptive connections share the available bandwidth 
among themselves, leaving less than 5% for the TCP connection. This, however. 
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was to be anticipated. With the acknowledgment scheme of TCP, senders are 
usually informed about packet losses within a time period much smaller than 
the minimal time between two RTCP packets, i.e., 5 seconds. With each loss 
notification, TCP reduces its transmission window by half. Older TCP versions, 
e.g., Tahoe-TCP reduce it even down to one packet [16]. So, while the adaptive 
source keeps on sending with the high transmission rate until an RTCP packet 
with loss indication arrives, the TCP source reduces its transmission window 
and thereby its rate. That is, the adaptive source can for a longer time period 
send data with the rate that might have actually caused the congestion. Also, 
as TCP reduces its transmission rate the congestion level will be decreased, so 
that the adaptive source will finally have to react to a reduced congestion level. 
Finally, the adaptive scheme will only start reacting to congestion if the losses are 
larger than the loss threshold. TCP, on the contrary, reacts to any lost packets. 
Therefore, in Fig. 3 we can notice that during the first 200 seconds, the TCP 
connection reduces its rate whereas the adaptive connection actually increases 
its transmission rate. The adaptive connection only starts reacting to congestion 
after measuring a loss ratio higher than the loss threshold that was set here to 
5%. However, as the loss remains below the 10% upper threshold, the adaptive 
connections can keep their high rate. 



4 The Direct Adjustment Algorithm 

The direct adjustment algorithm is based on both the AIMD approach as well 
as directly using the bandwidth share a TCP connection would be using under 
the same round trip time, packet size and loss ratio. 

During an RTP session the receiver reports in its control packets the per- 
centage of lost data noticed since sending the last control packet. At the sender 
site, the RTCP packets are processed and depending on the loss values reported 
within the RTCP packets, the sender can increase or decrease its sending rate. 
With the reception of each RTCP packet the sender needs to do the following: 

~ Calculate the round trip time (r) and determine the propagation delay. The 
RTCP receiver reports include fields describing the timestamp of the last 
received report from this sender Tlsr and the time elapsed since receiving 
this report and sending the corresponding receiver report Tdlsr- Knowing 
the arrival time (T) of the RTCP packet the end system can calculate the 
round trip time. 



T = T - Tdlsr - Tlsr (2) 

— The RTCP receiver reports contain the value of the average packet loss (1) 
measured for this sender in the time between sending two consecutive RTCP 
packets at the reporting receiver. To avoid reactions to sudden loss peaks in 
the network the sender determines a smoothed loss ratio Ig 



Zg = (1 — (t) X Is + (T X I 



(3) 
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with I as the loss value reported in the RTCP packet and ct as a smoothing 
factor set here to 0.3. This value was used in [4] and various simulations and 
measurements done in [17] suggested its suitability as well. 

— Using Is and r the sender calculates rtcp as in Eqn. 1. 

~ For the case of {Is > 0) the sender sets its transmission rate to 

(min [rtcp,radd]) with 

radd = r + A (4) 

with A as the additive increase factor and r as the current transmission rate. 



5 Performance of the Direct Adjustment Algorithm 

As a first performance test of the direct adjustment algorithm we used the topol- 
ogy depicted in Fig. 4. The model consists of 15 connections sharing a bottleneck 
router. All connections use the direct adjustment algorithm and are persistent 
sources. That is, they always have data to send with the maximum allowed rate. 
They all have the same round trip times and are similar in their requirements 
and characteristics. The routers used RED for buffer management. In all of our 
simulations, we set the packet size to 1 kbyte which is a typical video packet size. 
Tab. 1 depicts the average utilization results achieved for different round trip 




Fig. 4. DAA performance testing topology 



times and different link bandwidths. Fig. 6 shows the average bandwidths the 
single connections achieved for simulation runs of 1000 seconds. Fig. 5 describe 
the average rates achieved for each flow under different simulation parameters. 
The similar values of the bandwidth shares indicate the fairness of DAA among 
competind DAA flows. The results shown in Tab. 1 as well as Fig. 5 reveal that 
using the direct adjustment algorithm bandwidth utilization between 60% and 
99% is possible and that the bandwidth is equally distributed among all connec- 
tions sharing the same link. Note also in Fig. 6, that while connection number 
15 starts 100 seconds later than the other connections it soon reaches the same 
bandwidth level. 
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Table 1. Link utilization with the direct adjustment algorithm for different 
propagation delays (r) and link rates (a) 
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(b) r = 10 msec 
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Fig. 5. Rate of the single connections of DAA under different round trip times 
(r) and link bandwidths (a) 



The variance in the achieved utilization results from the oscillatory behavior 
of the scheme. As Fig. 6 shows, the transmission rate of the senders varies in 
the range of ±30% of the average rate. These oscillations result in part from the 
AIMD approach and in part from the partial incorrectness of the TCP model. 
AIMD schemes are inherently oscillatory. That is, such systems do not con- 
verge to an equilibrium but oscillate around the optimal state [18]. Also, the 
TCP-model we are using for determining the maximum allowed transmission 
rate does not consider the bandwidth available on the network or the number of 
connections sharing a link. Thereby, it might result in throughput values much 
higher than available on a link, and in other case might underestimate the ap- 
propriate bandwidth shares each connection should be using. This is especially 
evident in Fig. 6(d). Due to the high propagation delay, loss values as low as 
0.1% result in a TCP throughput of only 1.5 Mb/s, whereas each connection 
should have been using a share of 6 Mb/s. So, as the TCP model does not take 
the available bandwidth into account these oscillations can not be prevented. In 
addition, note that the adjustment decisions are made here based on the loss 
and delay values reported in the RTCP packets. The TCP-throughput model is, 
however, based on considering the loss and delay values measured throughout 
the lifetime of a connection. 
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Fig. 6. Temporal behavior of the direct adjustment algorithm under different 
round trip times (a) and link bandwidths (r) 



6 TCP and the Direct Adjustment Algorithm 

When designing the direct adjustment algorithm we aimed not only at achiev- 
ing high utilization and low losses but also wanted the adaptation to be TCP- 
friendly. 

Fig. 7 shows bandwidth distribution achieved with the direct adjustment 
algorithm using the topology described in Fig. 4. While using the adaptation 
scheme presented in [4] leads to the starvation of the TCP connections, see 
Fig. 3, using the direct adjustment algorithm results in a near equal bandwidth 
distribution: the TCP connection gets around 30% of the available bandwidth 
and the two adaptive connections each get 35%. This slight unfairness might be 
explained with the long control intervals of the RTCP protocol. As the adaptive 
senders update their transmission rates only every few seconds they might keep 
on sending with a high rate during congestion periods until a control message 
indicating losses arrives. During this time, the TCP connection reduces its trans- 
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Fig. 7. Bandwidth distribution with the direct adjustment algorithm and TCP 



mission window and thereby leads to a congestion reduction. Hence, the RTCP 
messages indicate a reduced congestion state. 

7 Summary and Future Work 

In this paper, we presented a new approach for dynamically adjusting the sending 
rate of applications to the congestion level observed in the network. This adjust- 
ment is done in a TCP-friendly way based on an enhanced TCP-throughput 
model. The senders can increase their sending rate during underload situations 
and then reduce it during overload periods. We have run various simulations 
investigating the performance of the scheme, its effects on TCP connections 
sharing the same bottleneck and its fairness. 

In terms of TCP-friendliness DAA performs better than schemes based solely 
on the AIMD approach [4] . However, the resuts presented here suggest that the 
adaptation approach used for DAA describes more of “how not td' realize conges- 
tion control for multimedia communication than “how td' . Using the TCP-model 
to periodically set the transmission rate results in a very oscillative transmission 
behaviour which is not suitable for multimedia streams. Due to rapid variations 
in the network congestion state observing losses and delays for short periods of 
time and using those values to estimate the TCP-friendly bandwidth share using 
the TCP-model results in very oscillative values that do not often resemble the 
actual resource availability in the network. To achieve a more stable behavior 
the network congestion state over moving windows of several seconds need to be 
observed. Work done based on such an approach [19] confirms these observations. 

Another major point that needs to be considered here, is the value to use 
for the additive increase factor (A). In our simulations, we set the A based on 
running different simulations and choosing the most appropriate results. In order 
to cover a wide range of possible communication scenarios, the increase rate 
needs to be set dynamically in a way that is suited to the number of competing 
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flows, the available resources and network load. Better tuned additive increase 
factors might also lead to a more stable bandwidth distribution. 
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Abstract. ACK filtering has been proposed as a technique to alleviate 
the congestion on the reverse path of a TCP connection. In the literature 
the case of a one- ACK per connection at a time in the buffer at the input 
of a slow channel has been studied. In this paper we show that this is 
too aggressive for short transfers. We study hrst static hltering where 
a certain ACK queue length is allowed. We show analytically how this 
length needs to be chosen. We present then some algorithms that adapt 
the filtering of ACKs as a function of the slow channel utilization rather 
than the ACK queue length. 



1 Introduction 

Accessing the Internet via asymmetric paths is becoming common with the intro- 
duction of satellite and cable networks. Users download data from the Internet 
via a high speed link (e.g., a satellite link at 2 Mbps) and send requests and 
acknowledgements (ACK) via a slow reverse channel (e.g., a dial-up modem line 
at 64 kbps). Figure 1 shows an example of such asymmetric path. It has been 
shown [1,3,5,8,13] that the slowness of the reverse channel limits the throughput 
of TCP transfers [10,15] running in the forward direction. A queue of ACKs 
builds up in the buffer at the input of the reverse channel (we call it the reverse 
buffer) causing an increase in the round trip time (RTT) and an overflow of the 
buffer (i.e., loss of ACKs). This results in a performance deterioration for many 
reasons. First, an increase in RTT reduces the throughput since the transmission 
rate of a TCP connection is equal at any moment to the window size divided by 
RTT. Second, the increase in RTT as well as the loss of ACKs slows the win- 
dow increase. Third, the loss of ACKs results in gaps in the ACK clock which 
leads to burstiness at the source. Fourth, the loss of ACKs reduces the capacity 
of TCP (especially Reno) to recover from losses without timeout [9]. Another 
problem has been also reported in case of multiple connections contending for 
the reverse channel. This is the problem of deadlock of a new connection sharing 
the reverse channel with already running connections [13]. Due to the overflow 
of the reverse buffer, the new connection suffers from the loss of its first ACKs 
which prohibits it from increasing its window. This deadlock continues until the 

* A detailed version of this paper is available as an INRIA Research Report at 
http://www.inria.fr/RRRT /RR-3908.html 
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Fig. 1. An asymmetric path 



dominant connections reduce their rates. We can see this latter problem as a re- 
sult of an unfairness in the distribution of the slow channel bandwidth between 
the different flows. The main reason for such unfairness is that a flow of ACKs 
is not responsive to drops as a TCP data flow. 

Many solutions have been proposed for this problem of bandwidth asymme- 
try. Except the header compression solution (e.g., the SLIP algorithm in [11]) 
which proposes to reduce the size of ACKs in order to increase the rate of the 
reverse channel in terms of ACKs per unit of time, the other solutions try to 
match the rate of ACKs to the rate of the reverse channel. The match can be 
done by either delaying ACKs at the destination [3,8], or filtering them in the 
reverse buffer [1,3]. Adaptive delaying of ACKs at the destination, called also 
ACK congestion control [3], requires the implementation of new mechanisms at 
the receiver together with some feedback from the network. The idea is to adapt 
the generation rate of ACKs as a function of the reverse buffer occupancy. ACK 
filtering however requires only modification at the reverse buffer. It profits from 
the cumulative nature of ACKs. An ACK can be safely substituted by a sub- 
sequent ACK carrying a larger sequence number. From ACK content point of 
view, there is no need for queueing ACKs in the reverse buffer. Thus, when an 
ACK arrives at the reverse channel, the reverse buffer is scanned for the ACKs 
from the same connection and some (or all) of these ACKs are erased. The buffer 
occupancy is then maintained at low levels. 

In this paper we ask the question of how many ACKs from a connection 
we must queue in the reverse buffer before we start filtering. In the literature 
the case of a one ACK per-connection at a time has been studied [1,3]. When 
an ACK arrives and before being queued, the reverse buffer erases any ACK 
from the same connection. Clearly, this behavior optimizes the end-to-end delay 
and the queue length, but it ignores the fact that TCP uses ACKs to increase 
its window. This may not have an impact on the congestion avoidance phase. 
However, it has certainly an impact on slow start where the window is small 
and needs to be increased as quick as possible to achieve good performance. The 
impact on slow start comes from the fact that TCP is known to be bursty during 
this phase [2,4,12], and thus ACKs arrive in bursts at the reverse buffer. ACKs 
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may also arrive in bursts due to some compression of data packets or ACKs 
in the network. Filtering bursts of ACKs will result in few ACKs reaching the 
source and then in a slow window increase. This negative impact of ACK filtering 
will be important on short transfers which dominate most of today’s Internet 
traffic [6] and where most of the transfer is done during slow start. In particular, 
it will be pronounced over long delay links (e.g., satellite links) where slow start 
is already slow enough [5] . Authorizing some number of ACKs from a connection 
to be queued in the buffer before the start of filtering will have the advantage 
of absorbing these bursts of ACKs which will result in faster window increase. 
However, this threshold must be kept at a small value in order to limit the end- 
to-end delay. A certain tradeoff then appears; one must predict an improvement 
in the performance as the threshold increases, followed by a deterioration in the 
performance when it becomes large (see Figure 6 for an example). 

We study first the case when the ACK filtering threshold (the number of 
ACKs that a connection can queue) is set to a fixed value. We show analytically 
how this threshold must be chosen. We present then our algorithm, we call 
Delayed ACK Filtering, that adapts the filtering of ACKs as a function of the 
slow channel utilization rather than the ACK queue length. This is equivalent 
to a dynamic setting of the ACK filtering threshold. The objective is to pass as 
many ACKs as possible to the source while maintaining the end-to-end delay at 
small values. In case of many connections, our algorithm adapts the filtering in a 
way to share fairly the slow channel bandwidth between the different connections. 



2 Impact of ACK Filtering Threshold 

We focus on short transfers which are considerably affected by the slow start 
phase. We consider first the case of a single transfer. The burstiness of ACKs 
caused by slow start itself is considered. Our objective is to show that delaying 
the filtering until a certain number of ACKs get queued, shortens the slow start 
phase and improves the performance if this threshold is correctly set. We assume 
in the sequel that router buffers in the forward direction are large enough so that 
they absorb the burstiness of traffic resulting from the filtering of ACKs. We 
don’t address later the problem of burstiness of traffic in the forward direction 
since our algorithms reduce this burstiness compared to the classical one- ACK 
at a time filtering strategy. 



2.1 TCP and Network Model 

Let yLr be the bandwidth available on the reverse path and let T be the constant 
component of RTT (in absence of queueing delay) . Hr is measured in terms of 
ACKs per unit of time. Assume that RTT increases only when ACKs are queued 
in the reverse buffer. This happens when the reverse channel is fully utilized, 
which in turn happens when the number of ACKs arriving at the reverse buffer 
per T is more than yirT. In the case of no queueing in the reverse buffer, RTT is 
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taken equal to T. This assumption holds given the considerable slowness of the 
reverse channel with respect to the other links on the path. 

Assume for the moment that the reverse buffer is large and that ACKs are 
not filtered. The window at the sender grows then exponentially with a rate 
function of the frequency at which the receiver acknowledges packets. Recall 
that we are working in the slow start mode where the window is increased by 
one packet upon every ACK arrival [10]. Suppose that the receiver acknowledges 
every d packets, thus the window increases by a factor a = 1 + 1/d every RTT. 
Note that most of TCP implementations acknowledge every other data packet 
(d = 2) [15]. Denote by W{n) the congestion window size at the end of the nth 
RTT. It follows that, W{n + \) = {d+ \)W{n)/d = aW{n). For W{0) = 1, this 
gives W(ji) = a" which shows well the exponential increase. 

Once the reverse channel is fully utilized, ACKs start to arrive at the source 
at a constant rate fXr- Here, the window together with RTT start to increase 
linearly with time. The transmission rate, which is equal to the window size 
divided by RTT, stops increasing and becomes limited by the reverse channel. 
This continues until ACKs start to be filtered or dropped. RTT stops then 
increasing and the transmission rate resumes its increase with the window size 
(see Figure 5). The first remark we can make here is that the ACK queue length 
needs to be maintained at small values in order to get a small RTT and a 
better performance. An aggressive filtering (say one ACK per-connection) is 
then needed. But, due to the fast window increase during slow start, ACKs may 
arrive at the reverse buffer in separate bursts during which the rate is higher 
than /Tr, without having an average rate higher than fir (see [4] for a description 
of slow start burstiness) . An aggressive filtering will reduce the number of ACKs 
reaching the source whereas these bursts can be absorbed without causing any 
increase in RTT. Such absorption will result in a faster window increase. Given 
that RTT remains constant whenever the reverse channel is not fully utilized, a 
faster window increase results in a faster transmission rate increase and thus in a 
better performance. The general guideline for ACK filtering is to accept all ACKs 
until the slow channel becomes fully utilized, and then to filter them in order to 
limit the RTT. We consider first the case when a connection is allowed to queue a 
certain number of ACKs in the reverse buffer. This number, which we denote by S 
and which we call the ACK filtering threshold, is maintained constant during the 
connection lifetime. We study the impact of 6 on the performance and we show 
how it must be chosen. Later, we present algorithms that adapt ACK filtering as 
a function of the slow channel utilization. This permits a simpler implementation 
together with a better performance than fixing a certain number of ACKs. 

2.2 ACK Filtering Threshold 

During slow start, TCP is known to transmit packets in long bursts [4]. A burst 
of W{n) packets is transmitted at the beginning of round trip n. It causes the 
generation of W(n)/d ACKs which reach the source at the end of the RTT 
(Figure 2). Given that the reverse channel is the bottleneck and due to the 
window increase at the source, bursts of ACKs can be assumed to have a rate 
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/ir at the output of the reverse channel and a rate a^r at its input. During the 
receipt of a long burst of ACKs, a queue builds up in the reverse buffer at a rate 
{a — 1) fir- A long burst of X ACKs at a rate a^r causes the building of a queue 
of length X/{d + 1) ACKs. The full utilization of the reverse channel requires 
the receipt of a long burst of ACKs of length and then the absorption of 
a queue of length ^rT/{d + 1). This is the value of So, the optimum filtering 
threshold, the buffer must use: 



So = HrT/{d+ 1 ) 



( 1 ) 



2.3 Early ACK Filtering 

We consider now the case when S < So- When an ACK finds S ACKs in the 
buffer, the most recently received ACK is erased and the new ACK is queued at 
its place. We call this a static filtering strategy since S is not changed during the 
connection lifetime. Let us study the impact of S on the window increase rate. 

Starting from W{0) = 1, the window increases exponentially until round 
trip no where ACKs start to be filtered. This happens when the ACK queue 
length reaches S which in turn happens when the reverse buffer receives a long 
burst of length S{d + 1) ACKs at a rate a/j-r- Given that the length of the long 
burst of ACKs received during round trip ng is equal to W{no)/d, we write 
W{no) = a"” = Sd{d + 1). 

After round trip rig, ACKs start to be filtered and the window increase slows. 
To show the impact of S on the window increase, we define the following variables: 
Consider n > uq and put ourselves in the region where the slow channel is not 
fully utilized. We know that the maximum window increase rate (fj-r packets per 
unit of time) is achieved when the reverse channel is fully utilized, and the best 
performance is obtained when we reach the full utilization as soon as possible. 
Let N{n) denote the number of ACKs that leave the slow channel during round 
trip n. Given that we are in slow start, we have W{n + 1) = W{n) + N{n). 

The burst of data packets of length W{n+1) generates W{n + l)/d ACKs 
at the destination which reach the reverse buffer at a rate faster than The 
duration of this burst of ACKs is equal to the duration of the burst N [n) at the 
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output of the slow channel in the previous round trip. Recall that we are working 
in a case where the bandwidth available in the forward direction is very large 
compared to the rate of the slow reverse channel so that many packets can be 
transmitted at the source between the receipt of two ACKs. During the receipt 
of the W{n + l)/d ACKs, a queue of 5 ACKs is formed in the reverse buffer and 
the slow channel transmits N{n) ACKs. The ACKs stored in the reverse buffer 
whose number is equal to 5 are then sent. Thus, iV(n + 1) = N{n) + 5. Figure 3 
shows the occupancy of the reverse buffer as a function of time after the start 
of ACK filtering and before the full utilization of the slow reverse channel. We 
can write for n > no, 

N{n) = N{n — 1) + 5 = N{no) + {n — no)S 

= W{no)/d + {n — no)S = 6{d + 1 + n — no) 

n— 1 

W(n) = W{n - 1) + N{n - 1) = W{no) + ^ N{k) 

k—no 

= S[d{d + 1) + (n — no){d + 1) + (n — no)(n — no — l)/2] 

We remark that due to the small value of 5, the window increase changes from 
exponential to polynomial and it slows with 5. The source spends more time 
before reaching the maximum window increase rate of packets per unit of 
time. 

2.4 Simulation 

Consider a simulation scenario where a TCP Reno source transmits packets of 
size 1000 Bytes over a forward link of 10 Mbps to a destination that acknowledges 
every data packet (d = 1). The forward buffer, the receiver window, as well as 
the slow start threshold at the beginning of the connection, are set to high 
values. ACKs cross a slow channel of 100 kbps back to the destination. T is set 
to 200 ms. We use the ns simulator [14] and we monitor the packets sent during 
the slow start phase at the beginning of the connection. We implement an ACK 
filtering strategy that limits the number of ACKs in the reverse buffer to 6, and 
we compare the performance for different values of 5. The reverse buffer itself 
is set to a large value. We provide three figures where we plot as a function of 
time, the window size (Figure 4), the transmission rate (Figure 5), and the last 
acknowledged sequence number (Figure 6). The transmission rate is averaged 
over intervals of 200 ms. 

For such scenario the calculation gives a 5o equal to 30 ACKs (Equation (1)). 
A 5 less than So slows the window growth. We see this for i5 = 1 and (5 = 3 in 
Figure 4. The other values of 5 give the same window increase given that the 
filtering starts after the full utilization of the reverse channel, and thus the 
maximum window increase rate is reached at the same time. But, the window 
curve is not sufficient to study the performance since the same window may mean 
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Fig. 4. TCP window vs. time 



Fig. 5. Transmission rate vs. time 
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different performance if RTT is not the same. We must also look at the plot of 
the transmission rate (Figure 5). For small 5, RTT can be considered as always 
constant and the curves for the transmission rate and for the window have the 
same form. However, the transmission rate saturates when it reaches the available 
bandwidth in the forward direction (at 1000 kbps). It also saturates somewhere 
in the middle (e.g., at 3s for 5=1). This latter saturation corresponds to the time 
between the full utilization of the reverse channel and the convergence of RTT to 
its limit (T+S/^r) when S ACKs start to be always present in the reverse buffer. 
During the convergence, the transmission rate remains constant due to a linear 
increase of both the window and RTT. Once the RTT stabilizes, the transmission 
rate resumes its increase with the window. Note that the stabilization of RTT 
takes a long time for large 6 (around 5s for S = 500). Now, Figure 6 is an 
indication of the overall performance. We see well how taking S = 6o leads to 
the best performance since it gives a good compromise between delaying the 
filtering to improve the window increase and bringing it forward to reduce the 
RTT. While increasing S, the overall performance improves until S = So then 
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worsens. This conclusion may not be valid for long transfers where slow start 
has a slight impact on the overall performance. 



3 Delayed Filtering: Case of a Single Connection 

Tracking the queue length for filtering is not a guarantee for good performance. 
First, in reality and due to the fluctuation of the exogenous traffic, the arrival 
of ACKs may be completely different than the theoretical arrival we described. 
Second, the calculation of the optimum threshold (Equation (1)) is difficult since 
it requires knowledge of RTT and the acknowledgement strategy (d). Third, 
setting the filtering threshold to a fixed value leads to an unnecessary increase 
in RTT in the steady state. Some mechanisms must be implemented in the 
reverse buffer to absorb the bursts of ACKs when the slow channel is not well 
utilized, and to filter ACKs with a small S otherwise. For the first and second 
problems, one can imagine to set the filtering threshold to the bandwidth-delay 
product of the return path (fJ-rT) in order to account for the most bursty case. 
The cost to pay here is a further increase in RTT in the steady state. 

The simplest solution is to measure the rate of ACKs at the output of the 
reverse buffer and to compare it to the channel bandwidth. Measuring the rate 
at the output rather than at the input is better since ACKs are spread over 
time which increases the precision of the measurement tool. In case we don’t 
know the channel bandwidth (e.g., case of a shared medium), one can measure 
how frequently ACKs are present in the reverse buffer. When the measurement 
indicates that the channel is fully utilized (e.g., the utilization exceeds a certain 
threshold that we fix to 90% in our simulations), we start to apply the classical 
filtering studied in the literature [3]; erase all old ACKs when a new ACK arrives. 
Once the utilization drops below a certain threshold, filtering is halted until the 
utilization increases again. This guarantees a maximum window increase during 
slow start and a minimum RTT in the steady state. We can see it as a dynamic 
filtering where 6 is set to infinity when the slow channel is under-utilized and 
to 1 when it is well utilized. Also, it can be seen as a transformation of the rate 
of the input flow of ACKs from A to min(A,/rr) without the loss of information, 
of course if the reverse buffer is large enough to absorb bursts of ACKs before 
the start of filtering. Recall that we are always working in the case of a single 
connection. The case of multiple concurrent connections is studied later. 



3.1 Utilization Measurement 

We assume that the slow channel bandwidth is known. The time sliding window 
(TSW) algorithm defined in [7] is used for ACK rate measurement. When a new 
ACK leaves the buffer, the time between this ACK and the last one is measured 
and the average rate is updated by taking a part of this new measurement and the 
rest from the past. The difference from classical low pass filters is that the decay 
time of the rate with the TSW algorithm is a function of the current average 
rate not only the frequency of measurements. The coefficient of contribution 
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Fig. 8. Reverse buffer occupancy for 
static filtering 



Fig. 9. Reverse buffer occupancy for 
delayed filtering 



of the new measurement is more important at low rates than at high rates. 
This guarantees a fast convergence in case of low rates. The decay time of the 
algorithm is controlled via a time window that decides how much the past is 
important. The algorithm is defined as follows: Let Rate be the average rate, 
Window be the time window, Last be the time when the last ACK has been seen. 
Now be the current time. Size be the size of the packet (40 bytes for ACKs). 
Then, upon ACK departure, 

1) Volume = Rate*Window + Size; 2) Time = Now - Last + Window; 
3) Rate= Volume /Time; 4) Last=Now; 

The same algorithm with a slight change can be applied to measure how fre- 
quently ACKs are present in the reverse buffer. 

3.2 Simulation 

We consider the same simulation scenario. We implement delayed filtering in the 
reverse buffer and we plot its performance. The time window is taken in the same 
order as RTT. Figure 7 shows how our algorithm gives as good performance as 
the case where <5 is correctly chosen. Moreover, it outperforms slightly the case of 
static filtering with S = 6o due to the erasing of all ACKs once the slow channel 
is fully utilized. The behavior of delayed filtering can be clearly seen in Figures 8 
and 9 where we plot the reverse buffer occupancy. These figures correspond to 
static filtering with (5 = Jo = 30 and delayed filtering respectively. Bursts of 
ACKs are absorbed at the beginning before being filtered some time later. The 
filtering starts approximately at the same time in the two cases. 

4 Delayed Filtering: Case of Multiple Connections 

Consider now the case of multiple connections running in the same direction 
and sharing the slow reverse channel for their ACKs. Let N be the number of 
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connections. In addition to the problems of end-to-end delay and reverse channel 
utilization, the existence of multiple connections raises the problem of fairness 
in sharing the reverse channel bandwidth. A new connection generating ACKs 
at less than its fair share from the bandwidth must be protected from other con- 
nections exceeding their fair share. We consider in this paper a max-min fairness 
where the slow channel bandwidth needs to be equally distributed between the 
different ACK flows. However, other fairness schemes could be studied. As an 
example, one can imagine to give ACKs from a new connection more bandwidth 
than ACKs from already running connections. We study the performance of 
different filtering strategies. We consider first the case of a large reverse buffer 
where ACKs are not lost but queued to be served later. There is no need here for 
an ACK drop strategy but rather for a filtering strategy that limits the queue 
length, that improves the utilization, and that provides a good fairness. Second, 
we study the case where the reverse buffer is small and where ACK filtering 
is not enough to maintain the queue length at less than the buffer size. ACK 
filtering needs to be extended in this second case by an ACK drop policy. 

4.1 Case of a Large Buffer 

To guarantee a certain fairness, we apply delayed filtering to every connection. A 
list of active connections is maintained in the reverse buffer. For every connection 
we store the average rate of its ACKs at the output of the reverse buffer. When 
an ACK arrives, the list of connections is checked. If no entry is found, a new 
entry is created for this connection. Now, if the average rate associated to the 
connection of the new ACK exceeds the slow channel bandwidth divided by 
the number of active connections, all ACKs belonging to the same connection 
are filtered and the new ACK is queued at the place of the oldest ACK. When 
an ACK leaves the buffer, the average rates of the different connections are 
updated. A TSW algorithm is again used for ACK rate measurement. The entry 
corresponding to a connection is freed once its average rate falls below a certain 
threshold. 

Keeping an entry per-connection seems to be the only problem with our 
algorithm. We believe that with the increase in processing speed, this problem 
does not exist. Also and as we will see later, we can stop our algorithm beyond 
a certain number of connections since it converges to static filtering with <5=1. 
This happens when the fair bandwidth share of a connection becomes very small. 
Classical filtering can be applied in this case without the need to account for full 
utilization. 

Now, delaying the filtering of ACKs from a connection separately from the 
other connections while keeping the classic FIFO service does not result in a 
complete isolation. The accepted burst of ACKs from an unfiltered connection 
increases the RTT seen by other connections. A round robin (RR) scheduling is 
required for such isolation. ACKs from filtered connections will no longer need 
to wait after ACKs from an unfiltered one. 

We implement the different filtering algorithms we cited above into ns and 
we use the simulation scenario of Figure 10. N TCP Reno sources transmit short 
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Fig. 10. Simulation scenario 





Fig. 11. Case of multiple connections Fig. 12. Case of multiple connections 
and a large buffer and a small buffer 



files of sizes chosen randomly with a uniform distribution between 10 kbytes and 
10 Mbytes to the same destination D. The propagation delays of access links are 
chosen randomly with a uniform distribution between 1 and 10 ms. ACKs from 
all the transfers return to the sources via a 100 kbps slow channel. A source Si 
transmits a file to D, waits for a small random time, and then transmits another 
file. We take a large reverse buffer and we change the number of sources from 
1 to 20. For every TV, we run the simulations for 1000s and we calculate the 
average throughput during a file transfer. We plot in Figure 11 the performance 
as a function of TV for five algorithms: no-filtering {6 = -koo), classical filtering 
((5 = 1), delayed filtering with all the connections grouped into one flow, per- 
connection delayed filtering with FIFO and with RR scheduling. 

No-filtering gives the worst performance due to the long queue of ACKs in the 
reverse buffer. Classical filtering solves this problem but it is too aggressive so it 
does not give the best performance especially for small number of connections. 
For large number of connections, the bandwidth share of a connection becomes 
small and classical filtering gives close performance to that of the best policy, per- 
connection filtering with RR scheduling. Single-flow delayed filtering is no other 
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than static filtering beyond a small number of connections, and per-connection 
filtering with FIFO scheduling gives worse performance than RR scheduling due 
to the impact of unfiltered connections on the RTT of filtered ones. 

4.2 Case of a Small Buffer 

The question we ask here is what happens if we decide to queue an ACK and we 
find that the buffer is full. In fact, this is the open problem of buffer management 
with a difference here in that the flows we are managing are not responsive to 
drops as TCP data flows. The other difference is that in our case, ACK dropping 
is preceded by ACK filtering which reduces the overload of ACKs on the reverse 
buffer. The buffer management policy is used only in the particular case where 
filtering is not enough to avoid the reverse buffer overflow. We can see the relation 
between filtering and dropping as two consecutive boxes. The first box which is 
the filtering box tries to eliminate the unnecessary information from the flow of 
ACKs. The filtered flow of ACKs is then sent into the second box which contains 
the reverse buffer with the appropriate drop policy. 

For classical filtering we use the normal drop tail policy. The buffer space is 
fairly shared between the different connections (one ACK per connection) and we 
don’t have enough information to use another more intelligent drop policy. The 
same drop tail policy is used in case of single-flow delayed filtering when ACKs 
are filtered. When ACKs are not filtered, we use the Longest Queue Drop policy 
described in [16]. The oldest ACK of the connection with the longest queue is 
dropped and the new ACK is queued at the end. Now, for per-connection delayed 
filtering, we profit in the drop procedure of the information available for filtering. 
The oldest ACK of the connection with the highest rate is dropped. 

We repeat the same simulation of the previous section but now with a small 
reverse buffer of 10 packets. The average throughput is shown in Figure 12 as a 
function of the number of sources. In this case and especially at large N, the dif- 
ference in performance is mainly due to the difference in the drop policy. This can 
be seen from the difference in performance at large N between per-connection 
delayed filtering and classical filtering. If the drop policy is not important, these 
two filtering strategies should give similar performance. Per-connection delayed 
filtering with RR scheduling gives again the best performance. The relative po- 
sition of classical filtering to no- filtering is a little surprising. When the number 
of connections is smaller than the buffer size, classical filtering is able to keep 
an empty place for a new connection and then to protect it from already run- 
ning connections. This leads to better performance than in case of no-filtering. 
However, as the number of connections exceeds the buffer size, the reverse buffer 
starts to overflow and new connections will no longer be protected. Thus, the 
performance of classical filtering deteriorates when N increases, and it drops 
below that of no-filtering for large N. We conclude that when the number of 
connections is larger than the buffer size, a simple drop tail policy is enough for 
good performance. This again limits the number of connections that our delayed 
filtering algorithm need to track. 
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Abstract. The emergence of new multimedia applications over IP 
(videoconference, voice and fax over IP, etc.) produces a great demand 
of QoS in the IP environment. However, QoS is difficult to implement 
in a best-effort network as Internet. For IP QoS, the Internet 
Engineering Task Force (IETF) developed two models: the Integrated 
Services architecture (IntServ) and the Differentiated Services 
(DiffServ). The IntServ architecture uses RSVP as a protocol for 
signalling and allowing resource reservation. This paper presents the 
design, implementation and test of a RSVP agent based on a generic 
QoS API. The generic QoS API may support new mechanisms for 
providing QoS in the future other than RSVP. Many applications are 
not designed taking into accoimt QoS management and source code is 
not available usually. Even if the source code is available, adapting an 
application for QoS management is not a straightforward task. 
Furthermore, it is not easy to identify the most suitable RSVP 
parameters for each profile. This agent allows both static and automatic 
reservations through an IP-probe matching the instantaneous resource 
requirements of the application and the reservation managed by RSVP. 
In order to validate the implementation and use of the RSVP agent, 
some tests were carried out with multimedia applications. These tests 
show some characteristics of the QoS-network and the agent behaviour. 



1 Introduction 

One of the most important issues in the current Internet is the quality of service 
(QoS). Quality of Service could be defined as the ability of a network element to have 
some level of assurance that its traffic and service requirements can be satisfied. The 
emergence of new multimedia applications over IP (videoconference, voice and fax 
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over IP, etc.) produces a great demand of QoS in the IP environment. However, QoS 
is difficult to implement in a best-effort network as Internet. But QoS is essential to 
run all the new applications and services that have more or less strict network 
requirements like delay and bandwidth. QoS can be characterized by a small set of 
measurable parameters: 

- Service availability, reliability of the connection between the user and Internet. 

- Delay, or latency; refers to the interval between transmitting and receiving 
packets between two reference points. 

- Delay variation, or jitter; refers to the variation in time duration between packets 
in a stream following the same route. 

- Throughput, the rate at which packets are transmitted through the network. 

- Packet loss rate: the maximum rate at which packets can be discarded during 
transfer through a network, typically caused by congestion. 

Furthermore, IP QoS will allow differentiation among Internet Service Providers 
(ISP) and produce new revenues since there will appear a set of added-value services 
as well as the legacy best-effort. 

A number of QoS architectures have been defined by various organizations in the 
communications industries. For IP QoS, it seems that two models will be 
preponderant. The Internet Engineering Task Force (IETF) developed these two 
models: the Integrated Services architecture (also referred to as IntServ) and the 
Differentiated Services (also referred to as DiffServ). 

The IntServ model was defined in RFC 1633 [1], which proposed the Resource 
Reservation Protocol (RSVP) as a working protocol for signalling in the IntServ 
architecture [2]. This protocol assumes that resources are reserved for every flow 
requiring QoS at every router hop in the path between receiver and transmitter, using 
end-to-end signalling. Scalability is a key architectural concern, since IntServ requires 
end-to-end signalling and must maintain a per-flow soft state at every router along the 
path. A core network with thousands of flows will not support this burden. Other 
concerns are how to authorize and prioritize reservations requests and what happens 
when signalling is not deployed end-to-end. It seems that IntServ will be deployed at 
the edge of the network where user flows can be managed at the desktop user level. 
An important driver for IntServ probably will be the Microsoftis implementation of 
RSVP in Windowsi2000. 

The DiffServ has defined a more scalable architecture in order to provide IP QoS 
[3]. It has particular relevance to service providers and carrier networks (Internet 
core). DiffServ minimises signalling and concentrates on aggregated flows and per 
hop behaviour applied to a network-wide set of traffic classes. Flows are classified 
according to predetermined rules at the network edge, such that many application 
flows are aggregated to a limited set of class flows. Currently there are two per hop 
behaviours (PHBs) defined in draft specifications that represent two service levels (or 
traffic classes): assured forwarding (AF) and expedited forwarding (EF). One 
important issue of DiffServ is that it cannot provide an end-to-end QoS architecture 
by itself Also, it is not a mature standard. 

Both models are not competitive or mutually exclusive, but on the contrary, they 
are complementary. 
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1.1 RSVP: Current RSVP-Capable Applications 

This section shows the relevant RSVP -capable applications available nowadays that 
we used as referenee in our agent implementation. You can find a large and detailed 
list of commereial and academic RSVP implementations (router, host, applications 
and test tools) in [4], 

MBONE Tools 

Currently, VIC and VAT have ports to RSVP. The Information Sciences Institute [5] 
has developed this porting. In order to support RSVP, the Tcl/Tk and C++ codes have 
been modified. Both solutions have some limitations. In the case of VIC, only 
bandwidth is taken into account. VAT uses an interesting set of mappings of audio 
coding to Tspec [6]. 

RTAP 

RTAP is a tool that comes with the RSVP distribution of ISI, and interacts with the 
RSVP daemon. It allows making manual reservations and has a complete and 
powerful debug function. 

USMInT 

USMInT [7] is a set of multimedia applications (video, audio, etc.) and uses and 
interesting communications protocol between media agents called PMM. Its RSVP 
support is based on the VIC with similar characteristics. 



2 RSVP Agent Design Based on a Generic QoS API 

The goal of the RSVP agent implementation is to create a software module that could 
interact with any multimedia application and provide it the possibility of requesting a 
determined QoS to the network. 

One important starting concern was the lack of access to the source code of most 
multimedia applications. It makes not feasible the direct communication between the 
agent and the application. This communication is necessary since the application 
activity normally is changing continuously (size of video windows, frames per 
second, number of users, etc.) which is translated to new network requirements. 



2.1 Generic QoS API Definition and Implementation 

The goal of the definition and utilisation of a generic QoS API is to isolate the 
application layer of our reservations agent from the network and below layers where 
the QoS mechanisms (signalling, admission control, scheduling and queuing) are 
implemented and have a high dependence on the technology. 

To define our generic QoS API we took as reference a proposal of the Quality of 
Service Internet2 Working Group [8]. Our QoS API definition (see Fig. I) takes into 
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account the possibility of using different QoS technologies as RSVP, IEEE 802.1Q/p, 
LANEmulation CoS, and so on. Currently we focus the work on RSVP technology 
and two basic platforms, Unix and Windows. For the Windows platform, we use 
Winsock 2.0 with its GenericQoS but we have important problems due its immature 
situation. 




Fig. 1. QoS API stack 

For the Unix platform (both Sun Solaris and Linux) we use the RAPI [9] and SCRAPI 
[10] RSVP interfaces from ISI distribution [5]. The main reason to use the SCRAPI is 
its simplicity of use, although we focus specially on the RAPI. 

Up to date, our QoS API implementation has these three main functions; 



QoS Bind (Destination, Profile, Role, Flandler, Source, QoS Authenticator) returns 
QoS SID 

This function creates and binds the application to a QoS session. It is no dependent on 
the underlying network protocols. Currently, it only supports RSVP (RAPI and 
SCRAPI). For an application acting as a RSVP "sender", QoS Bind would be 
responsible for sending the Tspec, Adspec, etc. to the network. For an application 
acting as an RSVP "receiver" QoS Bind would be responsible for making the 
reservation request to the network. Profile field contains network parameters 
translated from application parameters. Our agent implementation simply uses the IP 
probe for this mapping. There are three possible roles: talk (only sends), listen (only 
receives) and converse (both). The handler locates a routine that handles 
asynchronous notification events related to the QoS session. QoS SID is a structure 
that contains a unique QoS session identifier. Until now, we do not have in mind to 
implement a Bandwidth Broker and security treatment as appears in the Intemet2 
proposal. For this reason QoS_Authenticator is not used. 



QoS Rebind (QoS SID. Profile, Role. QoS Authenticator) 

This function allows the application to change the QoS specifications set originally by 
the QoS Bind. 



QoS Unbind (QoS SID, QoS Authenticator) 

This function tears down the QoS session and releases all acquired QoS resources. 
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2.2 RSVP Agent Implementation 

As we previously mentioned one of the goals of the RSVP Agent is to enable QoS to 
applications without the integration into its source code. Nonetheless, the use of the 
QoS API primitives in conjunction with the application is also possible. But other 
non-solved issues arise, like mapping the generic application parameters (fps, video 
size, type of coding, etc.) to the specific QoS network layer parameters (Tspec, ToS, 
etc.). A lot of effort is dedicated in this task trying to define profiles that map a given 
set of user parameters to a set of Tspec parameters. 

The basic idea to isolate the reservations agent from the applications is to insert an 
IP probe into the agent (see Fig. 2). The purpose of this IP probe is to monitor the 
network activity and assess the traffic characteristics the application is producing 
(mainly throughput). 




Fig. 2. RSVP Agent Implementation 

The IP probe is based on the tcpdump monitoring tool and a processing module. 
The IP probe has configurable parameters like sample duration, sample frequency and 
the minimum throughput variation threshold. It does not produce a significant burden 
to the host. The main parameter obtained is the dynamic bandwidth {bw) used by the 
different streams of the application. This parameter is properly mapped into the Tspec 
using the SCRAPI heuristic [10], that is, token rate equal to bw and bucket depth and 
peak rate equal to 2*bw. New reservation messages are only sent if the measured 
throughput variation is greater than a given threshold. 



3 Test and Results 

This section describes the tests carried out with de RSVP agent in the network. The 
goal of these tests was to evaluate the agent and the behaviour of a QoS-enabled 
network. 
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3.1 Methodology 

The methodology followed to achieve the evaluation was divided into two main parts: 
network behaviour using static reservations and Agent behaviour using automatic 
reservations. 

As a basic methodology we tried that all the tests followed the recommendations 
(e.g. [11] and [12]) of two IETF groups focused on measurement and benchmarking: 
IP Performance Metrics (IPPM) and Benchmarking Methodology (BMWG). 



Network behaviour using static reservations 

VIC and VAT were used to generate well-known multimedia traffic (video and audio) 
and the agent performed static reservations. Moreover, some background traffic was 
generated with different patterns (packet size, throughput, etc.) in order to produce a 
severe interference to the useful traffic (Vic and Vat). 

The purpose of these tests was to verify the success of reservations and to obtain an 
initial behaviour of the queuing techniques implemented in the router (Weighted Fair 
Queuing). The main parameter to measure was throughput (in bytes/sec and 
packets/sec). 

Agent behaviour using automatic reservations 

ISABEL application [13] was used to generate video and the agent performed 
reservations and refresh messages. ISABEL is a customizable multiconferencing 
application which provides advanced video/audio/data conferencing features as well 
as CSCW (Computer Supported Collaborative Work) support. We choose this 
application to implement QoS because we use it in regular events, both in a local 
(Spanish) and international scope (for a detailed list, see [14]): seminars and 
conferences broadcasting, projects tele-meetings, Ph. D and undergraduate courses 
and so on. 

The purpose of these tests was to verify the success of the automatic refreshed 
reservations. Also, detailed queuing behaviour was measured. The main parameters to 
be measured were throughput (in bytes/sec and packets/sec), delay and jitter. 



3.2 Scenario and Measurement Tools 

The network where we carried out all the tests is located in the Advanced Broadband 
Communications Centre (C.CABA) that belongs to the Universitat Polit cnica de 
Catalunya. 

The network is depicted in Fig. 3. One router Cisco 7206 interconnects two 
Ethernets and an ATM LAN. The ATM LAN is composed by two ATM switches and 
directly attached workstations with STM-1 interfaces (155.52 Mbps). 

Some workstations were used to produce and receive multimedia traffic (WS 1 and 
WS 6), background traffic or interference (WS3) and monitoring the network (WS 2 
and WS 5). These workstations were Sun running Solaris 2.5.1 and PCs running 
Linux RedHat 6.0. 
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Fig. 3. Test environment 

The applications used were the Sun versions of VIC and VAT, and the Linux version 
of ISABEL, both in unicast mode as well as the suitable version of the RSVP agent. 
The background traffic was generated by the MGEN application. We could generate 
and control severe interference locating the generator in an ATM LAN (WS3). Two 
workstations were measuring the traffic evolution in both Ethernets, using the 
tcpdump application. The results of this monitoring (IP headers and time-stamps) 
were processed to obtain throughputs, delays, etc. 

The Cisco 7206 was running lOS 12.0, supporting RSVP and Weighted Fair 
Queuing as algorithm used to schedule packets [15]. WFQ is essential for QoS 
support, other mechanisms like WRED were not used. This mechanism was 
configured as follows; 

fair-queue congestive-discard-threshold dynamic-queues reservahle-queues 

congestive-discard-threshold: number of messages allowed in each queue in the 
range 1 to 512. When the number of messages in the queue reaches the specified 
threshold, new messages are discarded. 

dynamic-queues; number of dynamic queues used for best-effort conversations 
(we fixed it to 256). 

reservahle-queues; number of reservable queues used for reserved conversations 
(we fixed it to 4). 

The first parameter is important since it affects to the overall performance 
(bandwidth and delay). 



3.3 Results 

This section shows the main results obtained from the evaluation of the network and 
the RSVP-agent. These results are divided into three main parts; 
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Overall network behaviour 

These results show how interference can affect to multimedia flows and how the use 
of RSVP involves in an improvement of performance. The RSVP agent was used 
manually with static reservations. In order to highlight different effects we tested the 
applications VIC and VAT through several steps: 

(a) No traffic and FIFO queuing in the router. 

(b) Background traffic composed by three streams with different patterns (packet 
size and pps). 

(d) VAT sends audio PCM at 64 Kbps. 

(e) VIC sends H.261 video at 3,1 Mbps (average). 

(f) Background traffic interferes to multimedia traffic. FIFO queuing. 

(g) Fair Queuing (FQ) is activated on the router. 

(h) RSVP is activated in Controlled Load service using Fixed Filter, reserving both 
video and audio flows. 

Several tests were carried out especially with different interference characteristics. 
This document only shows one case in order of briefness. Nonetheless, some 
interesting behaviour can be observed from Fig. 4, 5 and 6. Interference produces 
important losses to the video stream (Fig. 5, losses directly obtained from the VIC 
statistics), degrading the received video quality. 




Interference VAT VIC VIC+VAT+Int. VIC+VAT+Int. VIC+VAT+Int. 
FIFO FIFO FIFO FIFO Fair Queuing RSVP 



Fig. 4. Network performance with VIC and VAT: Mbits/sec 
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Fig. 5. Network performance with VIC and VAT: packets/sec 

Fair-queuing gives preference to flows with small packet size and low bandwidth 
requirements. Hence the audio flow, which has these characteristics, is hardly 
overwhelmed by interference when FQ is activated (see step g on Fig.6). 




Fig. 6. Network performance with VAT: Mbits/sec 

As final conclusion, RSVP does its job and provides QoS to the identified streams. 
Also, mechanisms like Fair-queuing can produce strange behaviours to high 
demanding flows with no reservations as well as other effects like increased delay. 
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Agent behaviour 

Similar tests were carried out with ISABEL as multimedia application and the RSVP 
agent acting automatically. The results show that the agent woks correctly and how 
the reservations are refreshed and adapt to new traffic requirements. 





1 Intarfarence: 5 ,8 M bps (7500 ppa of 1 00 bytes) | 
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Fig. 7. Comparative ISABEL behaviour: (a) FIFO (b) FQ (c) RSVP agent 
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In Fig. 7, you can notice the different application behaviour in reception depending 
on the network environment in three different modes (also different network 
performance). The first case shows ISABEL when the router is configured with FIFO 
queues. In the next one, the router is configured with Fair Queuing. The interference 
adversely affects to the video stream in both cases. The last case shows the 
application behaviour when the reservation agent is up. It adapts the reserve to the 
new network requirements. Note that the severe interference is reduced due to the FQ 
mechanism. 



Queuing effects 

This result is an approach to how the queuing discipline affects to the traffic. 
Although the bandwidth is reserved some no desired effects could be introduced like 
increased delay or jitter. 

The application under study, Isabel, sends the video streams in bursts (each burst 
correspond to a video frame). Capturing a large sample of bursts, we calculated the 
time between each packet contained in a burst, both in source and destination. The 
average delay variation (jitter) is represented in Fig. 8 in function of the queue and its 
configuration. Note that in the case of FIFO queuing there is not noticeable jitter, 
however when WFQ is used it is not negligible. 




Fig. 8. Queuing effects: Average time between packets of a burst 

Jitter affects to video quality producing in extreme cases some degradations and 
losses. Therefore, network nodes should be correctly configured in order to minimise 
delay but maintaining their capacity of providing an end-to-end QoS to all flows that 
require this. 



4 Conclusions and Future Work 

The implementation and use of QoS-enabled applications will be one of the drivers of 
new value-added services on the Internet. Therefore, we think it is basic to develop 
and deploy QoS-enabled applications. But adapting an application for QoS 
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management is not a straightforward task. Therefore, it is not easy to identify the most 
suitable RSVP parameters for each profile. We think this agent contributes to this task 
allowing both static and automatic reservations. The use of an IP-probe matching the 
instantaneous resource requirements of the application allows solving the mentioned 
issues. Also, we base the agent on a generic QoS API that may support new 
mechanisms for providing QoS in the future other than RSVP. 

We have some ongoing work on related topics. We are working in obtaining a 
comprehensive set of measurements of traffic with QoS requirements. These 
measurements are traditional as detailed throughput, delay effects in bursts, etc and 
its relationship or effects with the application layer parameters. Also we plan to 
characterize some multimedia sources in function of the T spec parameter. 



References 

1. R. Braden, D. Clark, and S. Shenker, "Integrated Services in the Internet 
Architecture: an Overview", RFC 1633. June 1994. 

2. J. Wroclawsky. The Use of RSVP with Integrated Services. RFC 2210, 1997. 

3. S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, W. Weiss. An Architecture 
for Differentiated Service. RFC 2475. December 1998. 

4. G. Gaines and M. Festa. RSVP-QoS Implementation Survey Version2. 
http://www.iit.nrc.ca/IETF/RSVP_survey/. July 1998. 

5. use Information Sciences Institute (ISI) Web Page: http://www.isi.edu/rsvp/ 

6. S. Shenker, J. Wroclawski. General Characterization Parameters for Integrated 
Service Network Elements. RFC 2215. September 1997. 

7. GMD FOKUS. USMInT Universal Scalable Multimedia in the Internet. 
http://www.fokus.gmd.de/research/cc/glone/projects/usmint, 1998. 

8. B. Riddle and A. Adamson. A Quality of Service API Proposal. Intemet2 
Technical Paper, May 1998. 

9. R. Braden and D. Hoffman. RAPI An RSVP Application Programming 
Interface. Version 5. Internet draft, draft-ietf-rsvp-rapi-OO, 1997. 

10. B. Lindell. SCRAPI A Simple Bare Bones! API for RSVP. Intemet- 
Draft.draft-lindell-rsvp-scrapi-02.txt. February 1999. 

11. S. Bradner, J. McQuaid. Benchmarking Methodology for Network Interconnect 
Devices. RFC 2544. Obsoletes RFC 1944. March 1999. 

12. V. Paxson, G. Aimes, J. Mahdavi, M. Mathis. Framework for IP Performance 
Metrics. RFC 2330. May 1998 

13. ISABEL Web Page: isabel.dit.upm.es 

14. C. CABA list of activities: http://www.ccaba.upc.es/activities/E_activities.html 

15. Cisco Systems Inc. Configuration Guides and Command References. 
http://www.cisco.com/univercd/cc/td/doc/product/software/iosl20/12cgcr/ 
index.htm, 1999. 




On the Feasibility of RSVP as General Signalling Interface 



1 1 O 1 O 

Martin Karsten\ Jens Schmitt % Nicole Eerier^, and Ralf Steinmetz^’^ 

1 : Darmstadt University of Technology 2: GMD IPSI 

Merckstr. 25 • 64283 Darmstadt • Germany Dolivostr. 15 • 64293 Darmstadt • Germany 

{ Martin. Karsten, Jens. Schmitt,Nicole.Berier, Ralf.Steinmetz } @ KOM.tu-darmstadt.de 
http://www.kom.e-technik.tu-darmstadt.de/ 

Abstract. Much debate exists whether explicit signalling is eventually required 
to create a reliable and integrated multi-service Internet. If yes, further disagree- 
ment exists, how such signalling has to be carried out. In this paper, we adopt the 
point of view that signalling of Quality of Service (QoS) requests must not be 
abandoned, given the high level of uncertainty about the future traffic mix in an 
integrated communication network. We present a flexible architecture, based on 
an extended version of RSVP, for signalling QoS requests. We approach the 
question of RSVP’s suitability for this purpose from two directions. First, we 
present the design of a QoS signalling architecture describing flexible, yet effi- 
cient interfaces between participating entities. Second, we report practical expe- 
rience from our ongoing effort to implement key components of this architecture. 

1 Introduction 

The invention of RSVP [1] and the Integrated Services (IntServ) architecture [2] has 
created significant expectations about the migration of the Internet towards an integrat- 
ed multi-service network. Afterwards, objections against the resulting signalling and 
data forwarding complexity have led to the establishment of a new working area, called 
Differentiated Services (DiffServ) [3], in which much simpler solutions are sought. 
However, recent results [4,5,6] have shown that only by installing static service level 
agreements (SLA), the theoretical worst-case performance guarantees for providing 
per-flow services might exhibit a larger conflict with the objective to utilize resources 
as efficient as possible, than often assumed. We conclude that building end-to-end serv- 
ices out of DiffServ Per-Hop-Behaviour (PHB) forwarding classes will not be fully suf- 
ficient to satisfy the diverse end-to-end requirements for a future Internet. Instead, we 
favour a combination of signalling service requests with a variety of topological scopes. 
On the other hand, we also question the usefulness of precipitous standardization of new 
signalling mechanisms, before the full potential of existing (yet maybe extended) pro- 
posals has been investigated and exploited. 

In this paper, we try to show how stringent decoupling of service interfaces from service 
creation (as e.g. initially intended for RSVP and IntServ) can create a new point of view 
on service signalling. The main goal for our work is to design and realize a flexible QoS 
signalling architecture, which is composed out of a few basic building blocks. At the 
same time, we try to adhere to existing standardization proposals as much as possible. 
This work is intended to be aligned with the recent lAB draft on QoS for IP [7]. 



*. This work is partially funded by the European Commission, 5th FW, 1ST, Project MSI (1 1429). 



J. Crowcroft, J. Roberts, and M. Smirnov (Eds.): QofIS 2000, LNCS 1922, pp. 105-1 16, 2000. 
@ Springer-Verlag Berlin Heidelberg 2000 



106 



Martin Karsten et al. 



The paper is organized as follows. We present an overall signalling architecture in 
Section 2 as well as certain extensions to the current RSVP specification in Section 3. 
In Section 4, we present a simple use case analysis to demonstrate the flexibility of our 
proposed architecture. Afterwards, in Section 5, we present experiences and quantita- 
tive results of our RSVP implementation to illustrate the point of view that, although its 
theoretical complexity, RSVP is not as inefficient as often assumed. We relate our work 
to other approaches in Section 6, as far as possible, and conclude this paper in Section 7 
with a summary and an outlook to future work items. 

2 Proposed Architecture 

One must clearly distinguish two roles of a signalling protocol like, e.g. RSVP. It has 
been initially designed as a distributed algorithm to enable multiple entities to coopera- 
tively deliver a certain service, i.e., multiple routers creating a reservation-based, end- 
to-end transmission service. On the other hand, it can be considered as an interface spec- 
ification to request services, regardless of how the service is technically constructed. 
The most important requirement to consider when assessing the basic architectural al- 
ternatives, is to consider interfaces (especially interfaces to end-users) as stable and hard 
to change. Therefore, service interfaces must be chosen carefully to be very flexible, ro- 
bust and compatible with future developments. On the other hand, a certain service in- 
terface must not inhibit the performant realization of services. The best way to accom- 
modate these goals is to make interfaces as lean yet expressive as possible. 

2.1 Concept 

Our proposal for an overall QoS signalling architecture conceptually consists of three 
layers as depicted in Figure 1 . It is assumed that a basic connectivity mechanism exists, 
which is given by a routing protocol and packet forwarding nodes called router. This is 
described as packet layer in the picture. The actual QoS technology is represented by an 
intermediate QoS layer. An entity that, besides carrying out router functionality, also 
performs packet-based load management by policing, shaping, scheduling, or marking 
packets for a certain scheduling objective is called QoS enabler. A pure QoS enabler, 
however, does not participate in end-to-end signalling. Advanced end-to-end services 
that allow to dynamically specify performance characteristics are realized using a com- 
plementary interface on the service layer. The entities of this layer, which handle service 
signalling and potentially flow-based load control (admission control) are denoted as 
service enabler. A service enabler can also perform the role of a QoS enabler. Of course, 
in a future QoS-enabled Internet, further open issues, such as QoS routing have to be 
addressed, as well. However, their eventual precise definition is currently beyond the 
scope of a QoS signalling architecture. 

The focus of this work is to flexibly realize a service layer that allows to integrate a 
variety of QoS layers. In the conceptual architecture, the layers can be considered as 
roles. Compared to previous work, the role or functionality of each layer is not bound 
to certain nodes in the network topology. Instead, it depends on a network operator’s 
particular choice of QoS technology and furthermore, on the service class, which node 
carries out the role of a certain layer. Detailed use cases are presented in Section 4. 
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Figure 1: QoS Signalling Architecture - Conceptual View 



2.2 Topological View 

When considering the topological view Table 1: Service Awareness of Network Nodes 
on this signalling architecture, interme- 
diate nodes have to be distinguished be- 
tween edge routers and interior routers. 

Service signalling takes place between 
at least edge routers. Depending on the 
service class and the particular QoS 
technology, intermediate routers might 
participate in the signalling, as well. 

Furthermore, subnets might employ 
bandwidth brokers to carry out resource 
allocation for the complete subnet for 
certain service classes. In this case, service requests can be forwarded from edge routers 
to the bandwidth broker. All nodes are classified as either service-aware, partially serv- 
ice-aware or service-unaware as depicted in Table 1. Note that the term service-una- 
ware does only denote that a node does not participate in service signalling. It might 
nevertheless carry out the role of a QoS enabler and thus, perform packet-based QoS- 
enabling mechanisms. In case of partially service-aware nodes, these nodes have to dis- 
tinguish whether to process or just forward a service request. The main criterion for this 
distinction is very likely to be the service class. This is further discussed in Section 4. 



Service 

Awareness 


Description 


service-aware 


RSVP-capable, support 
for all service classes 


partially 

service-aware 


RSVP-capable, support 
for some service classes 


service-unaware 


not RSVP-capable 



2.3 RSVP as General Mechanism 

In order to satisfy both goals of flexibility and optimization for highly demanding serv- 
ices when realizing a service layer, a solution is given by a uniform extended RSVP in- 
terface for advanced services. Using such an interface as service layer entity at each traf- 
fic exchange is both sufficient and effective to realize the conceptual architecture for 
multiple topological and QoS technology alternatives and to create meaningful end-to- 
end services. This design represents the choice to carry on with the Internet service ar- 
chitecture and employ RSVP (including the extensions presented in Section 3) as the 
primary signalling mechanism, especially for inter-domain signalling. Initially, it can 
then be used as a service interface between bandwidth brokers (particularly for dynamic 
DiffServ SLAs). 

However, the main motivation is given by the advantage that a future migration to 
employ RSVP in its initially intended style as distributed algorithm to request and pro- 
vide per-flow and potentially per-node service guarantees will be alleviated, if the basic 
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mechanisms are already in place. In such a future scenario, RSVP then acts as a signal- 
ling mechanism between each node, as well. Consequently, it is intended that a router 
employing this extended version of RSVP can efficiently handle both per-flow and ag- 
gregated service invocations of multiple service classes. The alternatives to invent dif- 
ferent signalling mechanisms for per-flow and aggregated service requests or different 
mechanisms for end-to-end and backbone signalling seem clearly inferior, especially if 
RSVP can be applied beyond its initial scope without introducing a large overhead. 

3 RSVP Extensions 

There are mainly two shortcomings in the currently specified version of RSVP, which 
aggravate its application as a general service interface: 

• Traffic flows are either identified by host or a multicast addresses, e.g., the spec- 
ification of subnets as source or destination address is not possible. 

• Path state information has to be stored for each service advertisement in order to 
ensure correct reverse routing of service requests. 

In order to appropriately extend RSVP’s functionality, existing ideas [8,9] have been 
taken up for this work and augmented to design a general processing engine for a lean 
and flexible service interface. The major goal is to achieve a high expressiveness for 
service interfaces. The extensions are mainly dedicated for, but not restricted to, unicast 
communication (including communication between subnets) and cover cases where the 
per-flow model of traditional RSVP signalling, which eventually exhibits quadratic 
state complexity [9], seems inefficient, because the requested transmission performance 
characteristics do not require flow isolation at each intermediate node. In that sense, the 
extensions are targeted to aggregated service requests on the control path. This has to be 
distinguished from the issue of aggregating flows on the data path. For the latter, careful 
network and traffic engineering, e.g. using MPLS [10], is required or alternatively, strict 
performance guarantees might be given by applying network calculus to multiple flows 
[11]. For both multicast in general and non-aggregated performance-sensitive (i.e. ine- 
lastic) unicast communication, the current version of RSVP can be considered as very 
well-suited, especially if recent proposals to increase the overall efficiency of RSVP op- 
eration [12] are realized. Note that the following extensions can be implemented with- 
out increasing the complexity of an RSVP engine. Flowever, they do extend the appli- 
cation scenarios to cover a variety of new alternatives. This is demonstrated by a use 
case analysis in Section 4. 

3.1 Compound Prefix Addressing 

The current specification of RSVP supports only host and multicast addresses. In order 
to specify service requests for traffic aggregates between subnets, the notion of address- 
es has to be extended to cover CIDR prefixes for network addresses. A respective pro- 
posal has been made in [8]. In the following, the term generalized address is used to re- 
fer to either an end-system’s address or a network’s address expressed as CIDR prefix, 
extended by class A network addresses and the special address prefix O.O.O.O/O denoting 
complete wildcarding. Additionally, it might be necessary to specify several of such ad- 
dresses within a single session or sender description, thus the notion of a compound ad- 
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dress is introduced, which consists of a set of generalized addresses. Of course, a dedi- 
cated node must exist within an end-subnet to receive and respond to such service re- 
quests. In principle, any node can emit such requests as long as they are authorized. 




Figure 2: Compound Addresses and Scoping Style 



In order to employ 
the full flexibility of 
compound address- 
es, it is inevitable to 
introduce a further 
generalization to 
specify their handling 
at certain nodes. Dur- 
ing the transmission 
of RSVP messages, 
targeted to a com- 
pound address, the border router towards the specified subnet(s) will be hit. In that case, 
it has to be decided whether the message is forwarded towards multiple destinations or 
not. If the message is not forwarded, then the resulting service essentially covers only a 
portion of the end-to-end path. If however, the message is forwarded into multiple sub- 
nets, it is not immediately clear how to interpret any quantitative expression of perform- 
ance characteristics. The term scoping style is used to describe the alternatives that such 
a message is forwarded to multiple next hops (open scope) or not (closed scope). To this 
end, it is an open issue whether the scoping style should be chosen by the node issuing 
a request or whether it is determined by the network provider depending on its local pol- 
icy how to provide certain services. As this is a matter of strategy and not mechanism, 
it is beyond the scope of this work to extensively investigate this question. Nevertheless, 
some use case examples are given in Section 4. In Figure 2, an example RESV message 
is shown to illustrate the choice between both alternatives. 



If RSVP’s addressing scheme is extended to include compound addresses, new chal- 
lenges are presented to the data forwarding engine of a router. In order to support flows 
targeted to or sent from an end-system at the same time as a session involving the subnet 
of this end-system, a longest-prefix match on both destination and source address might 
be necessary to distinguish which packets belong to which session. However, it can be 
expected that any service establishing performant communication for traffic aggregates 
between subnets is going to be built using a packet marking scheme, as e.g. the DiffServ 
model. In the DiffServ architecture, such a case is already considered and alleviated by 
the fact that only edge-routers are expected to do the full classification to isolate aggre- 
gate service contracts from individual flows. In the core of the network, traffic belong- 
ing to aggregates is forwarded according to its DiffServ marking and individual flows 
requiring total isolation can be appropriately serviced using a dedicated DiffServ mark 
and full packet classification. The same marking scheme can be applied to RSVP mes- 
sages themselves, such that per-flow request messages are transmitted to the appropriate 
end-subnet, but not processed by nodes along a trunk flow. This allows for transparent 
end-to-end signalling, even in case of intermediate flow mapping. 

A somewhat different treatment of port numbers is necessary to incorporate compound 
addresses into RSVP. It might be useful to specify a port number, if e.g., the resulting 
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service is used for a single application which can be identified through the port number. 
In any other case, the port number should be set to zero and effectively denote wildcard- 
ing. Analogous to the description in the previous paragraph, a classification challenge 
exists, which will be alleviated by employing a DiffServ-like marking scheme. 

A scheme of compound addresses in combination with the choice of scoping style is 
more appropriate for service requests between subnets than the initial approach to CIDR 
addressing of RSVP messages [8], because it overcomes the limitations induced by re- 
stricting source and destination to a single address prefix each. Furthermore, the scoping 
style provides a controllable way to deal with the resulting flexibility. Thereby, it is 
well-suited to especially provide a signalling mechanism and interface between band- 
width brokers which control the establishment of SLAs that are eventually provided to 
traffic aggregates by means of DiffServ code points. 

3.2 Hop Stacking 

To reduce the quadratic amount of state that has to be kept by routers in case of tradi- 
tional RSVP signalling, it is quite trivial to extend its specification similar to [9]. Usu- 
ally, PATH messages are sent along the same path as the data flow and state containing 
reverse routing information is kept at each node to allow forwarding of a RES V message 
along the reverse path towards the sender. In order to alleviate this effect for intermedi- 
ate nodes, a mechanisms termed hop stacking can be incorporated into RSVP. Each 
router has the option to replace the RSVP_HOP object by its own address and store ap- 
propriate state information in PATH messages (traditional operation). Alternatively, the 
address of the outgoing interface is stored as additional RSVP_HOP object in front of 
existing ones. During the service request phase, the full stack of such hop addresses is 
incorporated into RESV messages and used at respective nodes to forward the service 
request to previous hops, if no PATH state has been stored. On the way upstream, such 
a node removes its RSVP_HOP object and forwards the message to the next address 
found in the stack. This mechanism allows to install state information for service re- 
quests without the necessity to keep PATH state for each service announcement. This 
specification introduces even further flexibility as compared to other approaches in that 
stacking of hop addresses is optional and can be mixed with traditional processing with- 
in a single session. A node might even remove the full stack, store it locally together 
with the PATH state, and insert it into upstream RESV messages, such that the next 
downstream node does not have to deal with hop stacking at all. Figure 3 illustrates the 
flexibility of hop stacking. In this picture, nodes C and D perform hop stacking instead 
of storing local state whereas node E removes the full stack and stores it locally, such 
that node E does not realize the existence of stacked hops at all. An according RESV 
message travelling along the reverse path, can find its way back to the sender by local 
state or stacked hop information. 

From a node’s point of view, hop stacking provides a transparent method to employ oth- 
er approaches for QoS provision without per-flow state at intermediate nodes, e.g., 
RSVP over DiffServ-capable networks [13]. However, from an overall system’s point 
of view, hop stacking defines a generic mechanism to carry out RSVP signalling with- 
out PATH state at each node. It can be used for trunk signalling, tunnelling and provides 
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Figure 3: Hop Stacking for RSVP Messages 



for an open interaction with traffic and network engineering. In that sense, slightly more 
freedom is taken to extend the existing RSVP specification than other approaches. 

3.3 Interface Semantics 

While the extensions presented above form the procedural part of this proposal, it is im- 
portant to define coherent semantics at a service interface. The inherent meaning of ac- 
cepting a traditional RSVP message is to appropriately process and forward the request, 
establishing an end-to-end resource reservation. In our architecture, the semantics are 
changed such that the meaning of accepting a service request is a (legal) commitment to 
deliver this service, regardless of its actual realization. For example, compound address- 
ing provides an interface to transparently incorporate IP tunnels as presented in [14]. 
Similarly to the notion of edge pricing [15], this creates a notion of edge responsibility 
for the end-to-end service invocation. Effectively, an application’s data flow might be 
mapped onto several consecutive network flows in the notion of traditional RSVP. In 
that sense, intermediate nodes carrying out that mapping might actually be considered 
as “RSVP gateways” or “service gateways". 

4 Use Case Analysis 

In this section, a collection of use cases is described to conceptually show the flexibility 
of the RSVP-based signalling architecture to integrate diverse QoS technologies and 
create a variety of service scenarios besides RSVP’s initial designation for IntServ. The 
use cases focus on the mechanisms of service layer signalling between service enablers. 
In case of multiple alternatives, it is left open to further work to determine the optimal 
strategies to map service requests onto the underlying QoS technology. 

4.1 Supporting Diverse Subnets 

In the following it is briefly presented, how various QoS subnet technologies can be in- 
tegrated by this QoS signalling architecture. There has been a lot of work to support di- 
verse link-layer mechanisms, DiffServ clouds and ATM subnets. Most of these are well- 
known and treated (together with link layer technologies) by the IETF IS SEE working 
group (see [16] for a list of documents) and covered by a number of other publications, 
as well. However, there’s an additional scenario explained below. 

Service Signalling across ECN-priced Subnet. A somewhat speculative proposal to 
provide QoS has been made in [17]. It is based on intermediate nodes carrying out sta- 
tistical ECN-marking, which are interpreted as small charges at edge systems. It is 
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claimed that the resulting economic system provides a stable resource allocation which 
then could be considered to resemble a certain QoS. In order to mimic the all-or-nothing 
characteristic of regular admission control, the ingress of the subnet acts like a risk bro- 
ker and decides whether to accept or reject a service invocation. This risk broker subse- 
quently undertakes the economic risk of guaranteeing the accepted service even in the 
presence of rising congestion and thus, charges. Another option is for the ingress node 
to adapt the sending rate to the current congestion situation. Since the ECN mechanism 
is an end-to-end mechanism and usually requires a transport protocol to carry the feed- 
back from the receiver to the sender, it is not immediately obvious how such an ap- 
proach should be realized for a partial path in the network. However, if RSVP signalling 
is employed between the end nodes of such a partial path, the periodic exchange of 
RSVP messages can be used by the egress node to provide at least a some kind of feed- 
back to the ingress node. 

4.2 Flexible Service Signalling Techniques 

The following scenarios present a variety of service invocations that can be supported 
using the RSVP-based QoS signalling architecture. Note that all the scenarios presented 
below can be carried out at the same time in the same infrastructure. 

Reduced State Service Signalling in Backbone Networks. In this scenario, a back- 
bone network is assumed, which allows to establish trunk reservations between edge 
nodes, which are dynamic in size and routing path. Because of a potentially large 
number of edge nodes that advertise services to each other, it may be inappropriate to 
potentially keep state for each pair of edge nodes at routers. Furthermore, the service 
class does not provide precise service guarantees, but rather loosely defined bandwidth 
objectives. RSVP signalling can be carried out between each pair of nodes including the 
hop stacking extension. Path state is not stored at intermediate nodes and reservations 
towards a common sender are aggregate at each node. Consequently, the worst-case 
amount of state that has to be kept at each router is linear to the number of nodes, instead 
of quadratic. This example resembles the basic state reduction technique of BGRP [9]. 

Service Mapping of Flow Service to Trunk Service. The notion of compound prefix 
addresses allows to express service mappings of individual flows into aggregated trunk 
services. Individual flow requests that arrive at the ingress end of the trunk service are 
incorporated into a single service request, which is described by a compound prefix ad- 
dress and transmitted to the other end of the trunk. In Section 3.1, it is discussed, how 
to distinguish trunk traffic from other packets which might be exchanged between the 
corresponding end systems. Alternatively, a tunnel might established for the aggrega- 
tion part of the data path [14] and eligible packets are encapsulated into the tunnel. Nev- 
ertheless, it is useful to have a notion to describe the aggregate traffic flow, such that 
signalling can be carried out across multiple autonomous systems. 

Lightweight Service Signalling. One might even go one step further and consider an 
RSVP PATH message as service request, while RESV messages only confirm the cur- 
rently available resources. In that case, the end-systems keep track of the network state 
along the data path and no state information is stored at intermediate nodes. Such a see- 
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nario can be realized by a specific service class instructing each intermediate node to 
report its current load situation and service commitments, but without carrying out any 
particular activity for this request. PATH messages record their way through the net- 
work by hop stacking and RESV messages are initiated by receivers including the 
amount of service that this receiver requests. On their way back to the sender, the RESV 
message is used to collect the information whether this service is currently possible. In- 
termediate nodes are free to store as much state information as needed and feasible to 
report best-effort estimates of the current load situation. 

4.3 Application Scenarios 

In addition to the simple techniques described in the previous section, the following ex- 
amples describe more complete application scenarios which employ these techniques. 

Service Signalling for Dynamic Virtual Private Networks. Consider a corporate In- 
ternet user wishing to establish a virtual private network (VPN) between multiple loca- 
tions. Each of these locations operates an IP network with a different subnet address pre- 
fix. Eurthermore, it is deemed important to dynamically adapt the requested VPN capac- 
ity according to each locations current demand. In this example, it is examined how the 
resulting service requests are handled by a backbone network B, which is crossed by 
traffic from multiple locations. The scenario is illustrated in Eigure 4 . The corporate 
subnets are denoted with 8^,82, 83 and 84. The edge routers are depicted as E], E2 and 
E3. Each corporate subnet emits service advertisements (e.g. from a bandwidth broker 
or dedicated gateway) towards the other subnets, either separately or bundled with a 
compound destination address. The corresponding service requests might be treated 
separately or also be aggregated at certain nodes and targeted towards a compound send- 
er address. 

As an example, 8j advertises a 
certain total amount of traffic to- 
wards the other subnets, hence 
there is no specific description 
for each subnet. The advertise- 
ment is processed by Ej and for- 
warded to the other edge devic- 
es. If the backbone Q 08 technol- 
ogy is given by a combina-tion 
of static 8LAs and a bandwidth 
broker, E] obtains the informa- 
tion about multiple egress edge devices from the bandwidth broker and splits up the re- 
quest accordingly. If intermediate nodes also act as service enablers, the advertisement 
is forwarded as a bundle, until an intermediate node contains two routing entries for the 
different destination subnets. This is similar to multicast distribution and applies the 
service mapping technique described in the previous section. The correspondent service 
requests from 82, 83 and 84 traverse back to 8j establishing the subnet-to-subnet serv- 
ice. Because of the dynamic nature of R8VP signalling, the dimensioning of the VPN 
service can be adapted over the time. 




Figure 4: Virtual Private Network Scenario 
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Inter-Domain Service Signalling. A scenario of inter-domain trunk reservation sig- 
nalling has been described and carefully analysed in [9]. The same advantages as report- 
ed for BGRP can be obtained by employing the reduced state signalling technique de- 
scribed in the previous section. If combined with a recent proposal to bundle and relia- 
bly transmit refresh messages [12], RSVP provides a functionally equivalent solution 
having the same complexity as described there. However, there’s no completely new 
protocol needed. 

5 Experiences from Implementing RSVP 

As a main building block for our architecture we have realized a new implementation 
of RSVP, which is designated to clearly express RSVP message processing concepts in 
the code, be highly flexible and extensible. Furthermore, we have used an object-orient- 
ed design and implementation, e.g., to separate container implementations from the rest 
of the code. This approach allows to experiment with different data structures and algo- 
rithms for those containers that can become large and/or crucial for efficient execution. 
Details of the implementation are described in [18] and [19]. The full source code can 
be downloaded at http://www.kom.e-technik.tu-darmstadt.de/rsvp/. 

We have done some initial performance evaluations, which we consider quite promising 
with respect to RSVP’s ability to deal with a large number of flows. The implementation 
has not been subject to detailed code-level optimization, so far. However, on a FreeBSD 
workstation, equipped with a single Pentium III 450 MHz processor, our implementa- 
tion is able to handle the signalling for at least 50,000 unicast flows under almost real- 
istic conditions (see [20] for details). From these numbers, we deduce that the applica- 
bility of RSVP as a general purpose signalling interface and protocol to handle both ag- 
gregated and per-flow service requests, is much better than generally assumed. 

Besides its complexity of operation, RSVP is often objected to as being overly complex 
for implementation. Our own experience shows that RSVP indeed exhibits a certain 
complexity. However, we have been able to realize an almost complete and even multi- 
threaded implementation of RSVP investing less than 18 person-months of develop- 
ment effort. Given the large applicability and the inherent complexity of the underlying 
problem of providing performant end-to-end services, we believe that this experience 
somewhat contradicts those objections. 

6 Related Work 

Because of the fairly broad scope of this paper, almost all research in the area of QoS 
for packet-switched networks can be considered as related work. Here, we have to re- 
strict ourselves to only a few relevant examples. 

Very interesting work has been carried out in the area of open signalling [21]. However, 
the focus of this work goes much beyond our understanding of signalling in both effort 
and goals. It is targeted towards creating programmable interfaces employing active net- 
working nodes. In that sense it can be considered more heavy-weight and less evolution- 
ary as compared to a simple protocol-based approach. 
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Many other proposal have been made for so-called “lightweight” signalling protocols, 
e.g. in [12,22,23]. While all these proposals contain interesting properties, we believe it 
is advantageous to approach the overall problem with a single homogeneous protocol as 
compared to using multiple protocols for different services and scopes, because a single 
protocol eliminates functional redundancy. 

In comparison to proposals how to carry out the inter- operation of multiple QoS mech- 
anisms, we concentrate on the interface role of a signalling protocol and take more free- 
dom to extend the current RSVP specification. Work as described in [10,13,14] can be 
considered as complementary, in that low-level detailed aspects of inter-operation are 
examined and solved. 

7 Conclusions and Future Work 

In this paper we have discussed and illustrated the feasibility of an extended version of 
RSVP to serve as general signalling interface for multi-service networks. We have pre- 
sented a flexible, role-based QoS signalling architecture, based on an extended version 
of RSVP. This architecture utilizes the observation that a signalling protocol can be con- 
sidered as carrying out two roles, as distributed algorithm and interface mechanism. Af- 
terwards, we have presented a use case analysis to demonstrate that such a system ar- 
chitecture can enable general service signalling for a large variety of service classes, in- 
cluding aggregate and per-flow services. Experiences and performance numbers from 
creating the basic building block, a new implementation of RSVP, have been included 
in this paper to argue against common prejudices in this area. Finally, we have briefly 
discussed the relation of this work to other approaches. 

We intend to realize the full system described in this paper, partially in the framework 
of a cooperative European research project. If time permits, further examination and 
tuning of the core RSVP engine will be carried out in the future. A particular focus of 
our research agenda will be the generic yet efficient realization of inter-operation be- 
tween RSVP and actual QoS technologies, such as DiffServ. Of course, the discussion 
about the best way to provide quantitative and reliable QoS assurances in the Internet, 
to eventually create a truly multi-service network, is still open and further work is need- 
ed on all aspects. 
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Abstract. As the Internet is undergoing substantial changes, there are 
many demands for the quality of the communication. Differentiated Ser- 
vices was standardized to meet such requirements. Currently, it is as- 
sumed that Diffserv is used under static Service Level Agreement (SLA). 
Under static SLA, the agreement between the service provider and the 
service subscribers are made as a result of negotiation between human 
agents. However, the service subscribers, especially individual users, may 
want dynamic SLA with which they can make their contract without hu- 
man intervention and use the network resources immediately. Although 
many literatures addressed this point, operation of dynamic SLA under 
live network has not been applied. Moreover, only few experiments for 
dynamic SLA has been made. 

In this paper, we describe our field-trial with dynamic SLA and resource 
reservation system at the spring retreat of WIDE Project, including its 
network configuration and the results of trial. Through this experiment, 
we attempt to reveal that how users behave with dynamic SLA, and 
what mechanism such system needs. 



1 Introduction 

The improvements of the Internet technology iir the last decade have shifted 
its technical focus from the reachability to the quality of the commuiricatioir. 
Internet Service Providers (ISPs) diversify their services and geirerate exterirality 
so that they can provide value added services. Today, they have the discussioir 
oir what kiird of services should be offered and what kiird of service mairageiuent 
architecture the services require. 

Traditional quality of service (QoS) mechanisms provide a hard bound of 
the resource allocation, known as deterministic guaranteed services [1]. To bring 
this telephone-like services to the Internet, Integrated Services (Intserv) [2] was 
standardized. However, there are some technical issues in this Intserv framework. 
First, the number of flow states the routers have to maintain is a major issue. In 
the Intserv architecture, all the routers oir the way from the traffic source to its 
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sink have to keep per-flow states. Therefore, the Intserv framework cannot be 
applied to the large scale networks such as the core backbone network in com- 
mercial ISPs. Furthermore, the Intserv does not meet an essential requirement. 
With the QoS guaranteed service, it is required that service subscribers does not 
have to specify how much bandwidth should be guaranteed in the network. In 
many cases, it is very hard for service subscribers to tell how much bandwidth 
the service needs specifically. However, still the service subscribers want to ask 
ISPs to guarantee the quality of services. Therefore, the QoS management re- 
quires some kind of adaptive mechanism to fulfill what the subscribers want. 
Expected Capacity [3] was proposed to achieve this adaptive mechanism in the 
Intserv framework. 

Differentiated Services (Diffserv) [4] has been standardized to satisfy the re- 
quirements discussed above. The fundamental idea of the Differentiated Services 
is to aggregate flows which require the same quality of service and reduce state 
information that all routers in the core network must maintain. 

Currently, it is assumed that Diffserv is used with static service level agree- 
ment (SLA). Under static SLA, the agreement between the service provider and 
the service subscribers (down-stream ISPs or individual users) are made as a 
result of negotiation between human agents. 

However, it is expected that the subscribers are individual users in the small- 
est unit of the Diffserv-capable network, and they may want dynamic SLA with 
which they can make their contract without human intervention and use the 
network resources immediately. Nevertheless, few experiments have even been 
made to apply dynamic SLA to Diffserv-capable network in operation. There- 
fore, it has not become clear how users behave with dynamic SLA, and what 
mechanism such system needs. 

In this experiment, we focus on dynamic SLA for the Diffserv-capable net- 
work. We have conducted this experiment for 3 days in WIDE Project retreat. In 
this paper, we first show Diffserv and Inter-DS-domain model, and then describe 
where this experiment fits in our model. We then give details of our implemen- 
tation and experiment of admission control system. Finally, we show the results 
of this experiment and its evaluation. 



2 Differentiated Services 

Diffserv is a framework to combine multiple quality assurance mechanisms and 
provide statistical differentiation of services. Service level agreement (SLA) is 
made between subscriber and the service provider, and according to the agree- 
ment, the service provider offers various services. A service defines some charac- 
teristics of flow handling, such as flow classification or packet transmission. These 
characteristics are specified in quantitative or statistical indicator of throughput, 
delay, jitter, loss, or some other priority of access to network resources. 

At the boundary of the Diffserv-capable network, there are ingress/egress 
edge nodes. This Diffserv-capable network is called DS domain. A flow entering 
a DS domain is classified, marked in its IP header and possibly conditioned at 
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the ingress edge node, according to their SLA. This mark in IP header is called 
DS codepoint or DSCP [5]. Flows with the same DSCP are aggregated into a 
single behavior aggregate, and within the core of the DS domain, packets are 
forwarded according to the per-hop behavior associated with the DSCP. 

However, the Diffserv framework mentioned above includes only intra- 
domain architecture, and inter-domain architecture lacks operational perspec- 
tive. Namely, under current Diffserv framework, issues surrounding inter-domain 
architecture and dynamic SLA is not addressed at all. 

Another limitation of Diffserv framework is that it does not take practical 
Internet operation model into account. For example, today’s Internet is orga- 
nized as a hierarchical chain of DS domains, rooted at major Internet exchange 
points. Although one DS domain must eventually interact with subscribers, such 
situation has not been addressed in the framework document. 

Moreover, the current SLA framework in Diffserv is static one, because it 
makes deployment easy; decision-making about operational issues, such as band- 
width allocation plans or accounting, can be done by administrator. Such issues 
can be separated from the mechanisms such as router settings. 

However, authors believe that such static SLA is transitional. The require- 
ment for QoS/CoS can be roughly divided into persistent one and transient 
one, especially considering both ISP and subscribers’ viewpoints. The former 
example would be a virtual leased line, and the latter example would include 
limited-time broadcast of live events or administrative traffic for emergent server 
maintenance. Thus, it’s easy to imagine that QoS is shifting to the mixed envi- 
ronment with static SLA and dynamic SLA. In such environment, present policy 
framework is not sufficient, as it only supports static SLA. 

We believe that the development of dynamic SLA is vital to promote 
widespread application of Diffserv. 



3 Configuration of Diffserv Field- Trial 

We designed, implemented and administered an operational, real Diffserv net- 
work at the year 2000 spring retreat of WIDE Project [6], which we call “WIDE 
camp” . WIDE Project is a non-profit consortium for research and development 
of the Internet and related technologies. A 4-day camp for members of WIDE 
Project is held twice every year, once in spring and once in autumn. A time- 
limited operational network called “WIDE camp-net” has been built for every 
WIDE camp. The network is both highly operational and highly experimental; 
new networking technologies such as IPv6 and MPLS has been brought into 
operational network that most attendees rely on. 

At WIDE camp-net of 2000 spring, we have done an experiment of Diffserv 
network with immediate reservation request from users. WIDE camp of 2000 
spring was held at Isawa, Yamanashi, Japan, and there were 236 participants. 
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3.1 Field- Trial Overview 

In order to make deployment both visible and easy, we have incorporated a 
number of ideas into our test-bed system. 

Since our test-bed was incorporated within dual-stack, both IPv4 and IPv6- 
ready network, we had a number of routers, where traffic control should be 
applied. To eliminate operational overhead of traffic control, we isolated traffic 
control functions from routers by introducing ATM PVC bridging. One IPv4 
router also acts as ATM PVC bridge for IPv6. In other words, all traffic has 
been concentrated to one router. 

The QoS control to the commodity traffic of the WIDE camp-net were done 
at both ends of the external link. We used the COPS(Common Open Policy 
Service) [7] protocol for the provisioning the QoS control parameters to the 
routers. The COPS PDP (Policy Decision Point) that works as the provision- 
ing server, was located within the WIDE camp-net. The COPS PEP (Policy 
Enhancement Point) that receive the provisioned information from the PDP, 
and applys the QoS configuration, was located within routers at both sides of 
the external link. 

To avoid infinite occupation of network bandwidth, some means for pro- 
moting fair use of reservation must be implemented. We have implemented an 
accounting system, together with “virtual currency” that are withdrawn from 
each user’s account, according to usage history. 

We also provided reservation tool to users, so that users can directly issue 
reservation requests to the Diffserv-capable network. The tool was provided for 
both IPv4 and IPv6. 

Bandwidth reservation has been made visible through web-based front-end 
for bandwidth reservation, as well as remaining virtual currency in the user’s 
account. 

3.2 Network Topology 

In this section, we illustrate the network topology of experiment held at the 
WIDE camp-net. The network topology of the WIDE camp-net is shown in 
Fig.l. The camp-net is connected with the Internet using TI(1. 5Mbps) external 
link. ATM was used for layer 2 protocol of the T1 external link. 

3 virtual circuits were configured over the T1 external link. These VCs were 
used for the following purpose: 

1. connection to the IPv4 Internet 

2. connection to the IPv6 Internet via Keio University at Fujisawa 

3. connection to the IPv6 Internet via NAIST, Nara 

Each edge node described in Fig.l were configured as PEP. PDP was located 
within the WIDE camp-net. In this experiment, the PEPs with the TI external 
link was considered as a DS(Diffserv) domain. 

The layer 2 configuration at the external gateway of WIDE camp-net is shown 
in Fig. 2. The external Tl line is connected to the ATM switch. The ATM switch 
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NARA 




Fig. 1. Network Topology of WIDE camp-net 



is connected to a PC router. Connectivity to the IPv4 Internet is provided within 
the WIDE camp-net via the PC router. The ATM switch is also connected to two 
IPv6 routers. The VC for both IPv6 network is connected to the IPv6 Internet 
via the PC router. For traffic queueing, ALTQ[8] is used within the PC router. 
For the traffic from WIDE camp-net to the Internet, Diffserv AF style marking 
was done at the input queue of the PC router. In the output queue for the 
traffic from WIDE camp-net to the Internet, HFSC[9] with RIO[IO] was used 
for scheduling and queueing. 



Ethernet 

(IPv6) 



Ethernet 

(IPv6) 



Ethernet 

(IPv4) 



ATMSW 




Router 



Fujisawa 



Fig. 2. Layer 2 Configuration of Gateway of Isawa 



Each PC router in Isawa and Fujisawa was configured as a PEP. The PDP 
was located within the WIDE camp-net. The TCP connection between the PDP 
and PEP were kept alive during the experiment. 

3.3 Administrative Model 

In this section, we will show the administrative model of the experiment. In this 
experiment, the reservation services were provided to users within the WIDE 
camp-net. 
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For reservation, we divided the bandwidth of the external link into 19 blocks 
(Fig. 3). 18 blocks were used for reservation by users, and 1 block was used for 
the default traffic. Each 18 blocks consists of 64kbps. In this experiment, each 
user request results to reservation of one 64kbps block. Since the number of 
blocks for reservation is 18, the number of available reservation is limited to 18. 
However, during non-congested times, the available bandwidth will be shared by 
the existing flows. 

When a user reserves a 64kbps block, the reserved flow will be marked blue 
as in Diffserv AF manner. When the traffic of the reserved flow exceeds 64kbps, 
the reserved flow will be marked yellow. When the traffic of the reserved flow 
exceeds 64kbps extremely, the reserved flow will be marked red. The red marked 
packet will be treated as same as the default traffic. 



64kb 


1.5Mbps 

ps 
















64kbps X 1 8 


default 



Fig. 3. Bandwidth Division of 1.5Mbps External Link 



The reservation from the users is shown in Fig. 4. The reservation from the 
users are done in the following steps. 1) First, a user sends a request to the 
nearby PEP using TCP. 2) The PEP that received the request from the user 
sends a COPS REQ(Request) message to its PDP. The HANDLE for the COPS 
REQ message is created by the PEP. 3) If the PDP decides that the request from 
the PEP is acceptable, the PDP sends a COPS DEC(Decision) message to the 
connected PEPs. The HANDLE for the COPS DEC message is created by the 
PDP. 4) PEPs configures the configuration given by the COPS DEC message 
from the PDP, and reports the result to the PDP using the COPS RPT message. 
5) After receiving RPT messages from all the PEPs, the PDP sends a COPS 
DEC message to the PEP using the HANDLE generated with the COPS REQ 
message at 2). 6) The user is informed about the result of reservation by the 
PEP. 

3.4 User Tool for Reservation 

At year 2000 spring WIDE camp, attendees were considered as users of reserva- 
tion system. We distributed 2000WU (Wide Unit, a virtual currency) to every 
attendee, for reservation purposes. Reservation of 64kbps for 1 minute could be 
obtained by paying lOWU. 

We created a tool for requesting reservation from users. The tool was called 
”PEPe”. PEPe establishes a TCP connection to the PEP when sending a reser- 
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Fig. 4. Overview of Reservation from users 



vation request. The request message sent by PEPe is shown in Fig. 5. The user 
id and password was used to identify users. Flow information is included within 
the PEPe message. 




Fig. 5. PEPe message 



After every transaction for the request is processed within PEP and PDP, 
the PEPe receives a report from the PEP. The PEPe prints out the report from 
the PEP, and terminates the connection with the PEP. 
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4 Evaluation 

In this section, we discuss evaluation of the experiment. 

For evaluation, we created artificial congestion within the external link. The 
network topology for evaluation is shown in Fig. 6. 



Nara 




Fig. 6. Network topology for evaluation 



First, we used netperf[ll] with the options shown in Fig. 7. Netperf was used 

netperf -H isawa-gw -1 100000 s 64K -S 64K -m 1460 

Fig. 7. options for netperf 



in the following pattern, 1) from out side to inside of WIDE camp-net, 2) from 
inside to outside of WIDE camp-net, and 3) from both sides of WIDE camp-net. 
Next we emulated congestion by sending UDP traffic from outside to inside of 
WIDE camp-net. The traffic caused by sending UDP was CBR(Constant Bit 
Rate) traffic. 

The start time, end time and type of artificial congestion is shown in Table. 1. 

The normal traffic at the external link is shown in Fig. 8. Fig. 8 shows traffic 
from outside to inside of the WIDE camp-net. Fig. 9 shows traffic from outside 
to inside of the widecamp network with netperf. Since the queue length in the 
gateway was large, there were no packet loss within the external link. Fig. 10 
shows traffic from outside to inside of the WIDE camp-net with UDP traffic. 
The network was consumed most, when there was artificial congestion by sending 
UDP traffic. There were packet loss while UDP traffic was sent. 

The number of reservation requests from the users are shown in Fig. 11. The 
reservation request is sent most during heavy congestion at the external link. 

The number of reservation request errors (that is, failure to satisfy reser- 
vation requests) are shown in Fig. 12. The reservation request error is reported 
most during heavy congestion at the external link. Most of the errors during 
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Table 1. Artificial Congestion 



Start time 


End time 


type 


3/15 18:14:50 


3/15 18:19:05 


2 


3/15 23:02:57 


3/16 00:02:16 


3 


3/16 12:33:37 


3/16 13:39:28 


1 


3/16 13:42:54 


3/16 13:49:03 


1 


3/16 13:54:22 


3/16 15:06:01 


3 


3/16 16:19:47 


3/16 19:39:47 


UDP 


3/16 20:55:27 


3/16 23:49:24 


UDP 



heavy congestion were caused by overflow of reservations. In this experiment, we 
arranged 18 entries of 64Kbps bandwidth for reservation. When the 18 entries 
were full, the reservation request from the user resulted in error. 

This shows that the attendees feel the network bandwidth is worth reserving 
and want additional value to connectivity during heavy congestion. As a result, 
requests for bandwidth were mostly made during this period of time. On the 
contrary, during the term of no congestion, reservations were not requested by 
users. 

This also shows that the request is made more frequently with the progress 
of WIDE camp. This is because the experimental was first exposure to dynamic 
SLA system for most attendees, therefore at first they had some hesitation to 
reserve bandwidth. However, once they have learned effectiveness of bandwidth 
reservation system, they frequently used it. This particular observation indicates 
effectiveness of bandwidth reservation system under certain circumstances. 

5 Conclusion 

While Diffserv has been standardized to meet growing demands from diverse In- 
ternet applications, currently supported service model, i.e., static SLA, is rather 
limited. The authors believe that dynamic SLA will be used in near-term future. 

In order to achieve widespread use of dynamic SLA, it is necessary to under- 
stand user behavior under dynamic SLA-enhanced Diffserv Internet. We have 
designed, implemented and operated live Diffserv-capable Internet with specific 
focus on dynamic SLA. During 4 days of WIDE retreat, most attendees were 
able to use immediate bandwidth reservation system that we have developed. 

Since this was first exposure to bandwidth reservation system for most at- 
tendees, they had some reluctance to use this system until they face severe 
congestion, where they really need it. We enforced every attendee to learn and 
use this new tool by creating artificial congestion. After they have learned effec- 
tiveness of bandwidth reservation system, they eagerly used it. This particular 
observation indicates effectiveness of bandwidth reservation system under certain 
circumstances. 
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Thus, static SLA is useful only until dynamic SLA is widespread among sev- 
eral ISPs and subscribers are aware of its effectiveness. Our observation through 
field-trial at WIDE retreat confirms this argument. We believe that successive 
work on this topic will further support and amplify this discussion. 
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Fig. 11. The number of reservation requests 




Fig. 12. The number of reservation request errors 
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Abstract. Objectives for traffic engineering in IP core networks can be 
different for different Quality-of-Service classes. In this paper we study 
the traffic engineering objectives for best-effort flows and for virtual 
leased line service flows. We propose mixed integer linear program- 
ming models that can be used for off-line centralized traffic engineering 
with label switched paths, and develop efficient algorithms for their ap- 
proximate solution. We quantify the effect of the choice of the objective 
chosen for traffic engineering on the network performance, and assess 
the benefits of distributing the traffic between a single border node pair 
over multiple paths. 

Keywords: traffic engineering, explicit routing, Quality-of-Service, 
diffserv. 



1 Introduction 

I. 1 Background 

Traffic engineering (TE) is concerned with the optimisation of the performance of 
networks, or equivalently, the optimisation of resource usage to this end [1]. It is well 
known that traditional Interior Gateway Protocols (IGPs) such as Open Shortest Path 
First (OSPF) are inadequate for traffic engineering purposes because: 

V a routing strategy solely based on destination addresses makes it impossible to use 
different routes between the same source and destination for different Quality-of- 
Service (QoS) classes; 

V a routing strategy with the built-in restriction that only paths can be used that are 
shortest with respect to some metric lacks flexibility. 

The use of Multi-Protocol Label Switching (MPLS) [2] offers a way to obviate these 
deficiencies [3]. By explicitely setting up Label-Switched Paths (LSPs) one achieves a 

J. Crowcroft, J. Roberts, and M. Smirnov (Eds.): QoflS 2000, LNCS 1922, pp. 129-140, 2000. 
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much finer control over the way in which the flows through the network use the net- 
work resources. 



1.2 Contributions 

We propose algorithms that are applicable for off-line centralized traffic engineering 
over label-switched paths. These algorithms need a description of the network re- 
sources and a description of the load on the network, and determine the explicit routes 
that the flows will take. Our algorithms complement previous traffic engineering algo- 
rithms, which were based on the assumption that OSPF would be used for traffic engi- 
neering, and optimised the set of administrative weights OSPF uses [4,5]. 

We describe the load as a set of traffic trunks [3], and the traffic engineering algo- 
rithms calculate explicit routes for these traffic trunks. A traffic trunk (TT) is an ag- 
gregation of flows belonging to the same QoS class that are forwarded through a 
common path. A traffic trunk is characterised by its ingress and its egress node and by 
a set of attributes which determine its behavioral characteristics and requirements 
from the network. 

We consider two classes of traffic trunks. The first class contains traffic requiring a 
virtual leased-line service of the network. This is traffic that would be mapped onto 
the Expedited Forwarding (EF) Per-Hop Behavior (PHB) in a diffserv network. The 
second class contains best-effort traffic, that is, traffic mapped onto the Best-Effort 
(BE) PHB. In this paper we will show that the characteristics of these two classes can 
be captured in a traffic engineering model by describing the bandwidth requirement of 
BE and EF trunks in a different way, and by using of different traffic engineering 
objectives. Our approach will allow the network operator to differentiate the way in 
which the traffic engineering algorithm handles traffic trunks with different QoS re- 
quirements . 



1.3 Overview of the Paper 

In Section 2, we propose four basic mixed integer linear programming (MILP) models 
for off-line centralized traffic engineering over label-switched paths. 

In Section 3, we show that the MILP models can only be solved to optimality for 
small problem instances (small networks, or a small number of traffic trunks). We 
analyze the things that complieate their solution, and propose heuristic strategies to 
obtain good solutions in an acceptable amount of time. 

In Section 4, we present numerical results that illustrate the flexibility of our MILP 
models. We show the effect of the choice of the bandwidth requirement model and the 
traffic engineering objective. We quantify the impact of traffic engineering by com- 
paring the solutions we obtain with solutions in which all traffic trunks are routed via 
shortest paths, that is, the solution OSPF would come up with. Finally, we quantify the 
benefit of allowing the use of more than one path for each traffic trunk. 

In Section 5, we list the conclusions that we draw from our research and suggest 
possibilities for future work. 
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2 Model 

2.1 Network Model 

We represent the topology of the network as a directed graph G=(V,E). V is the node 
set of this graph, E is its line set. The endpoints of line e are denoted Ug and Vg. We 
denote the capacity of line e as Xg. 

The utilisation #g of link e is defined as the fraction of the capacity of link e that is 
actually used, that is, the amount of capacity used on link e is #,Xg. The utilisation of 
the links is known as soon as an explicit paths has been chosen for each TT. 

The following traffic engineering objectives can be termed resource-oriented be- 
cause they are defined in terms of the utilisation of network resources: 

V minimization of capacity usage. We define the capacity usage \J as the total 
amount of capacity that is used, that is, 17=3 gc/^^^g, 

V load balancing. The purpose of load balancing is to avoid that some of the links 
or nodes become bottlenecks, by spreading the traffic evenly over the links and 
nodes of the network. Although this is a resource-oriented TE objective, it is cou- 
pled with several traffic-oriented TE objectives, such as minimisation of packet 
loss and delay. If we define the balance B as one minus the maximal link utilisa- 
tion, that is, i?=l-maXe%E #g, load balancing is achieved by maximising B. 

2.2 Traffic and Routing Model 

We denote the set of traffic trunks as K. The endpoints of trunk k are denoted st and tk 
(the source and sink of k). We denote the set of all possible [5fo4]-paths in G^(V,E) by 
Pjg. The route followed by a path p in the network is described by (pg, which takes the 
value \ \ip crosses line e, and 0 if it doesn t. We associate a routing variable u^-p with 
each path. The value taken by this variable is 1 when path p is used by trunk k, and 0 
otherwise. 

We characterise the bandwidth requirement of an EF trunk by its nominal bit rate 
requirement <4. An EF trunk expects to be able to inject traffic into the network at a 
constant rate <4, that is, it expects a hard guarantee from the network. A BE trunk k 
has a bandwidth requirement described by an excess bit rate ) k, but does not really 
expect a guarantee from the network. 

The share of a traffic trunk is the amount of capacity that is allocated to it. The 
share that traffic trunk k receives is denoted by *k- For traffic trunks that do not im- 
pose a stringent bandwidth requirement, an optimization of the network performance 
can be achieved by considering the following two traffic engineering objectives. We 
call them traffic-oriented , because they are defined in terms of shares: 

V throughput maximization. We define throughput as the total bandwidth that is 

guaranteed by the network to its customers, that is, 7=3 *p, 

V maximizing fairness. We define the fairness of a TE solution as the minimum 

weighted share, that is, *k0 h 
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A traffic trunk k that asks for a hard guarantee for a bandwidth dk expects its share to 
be dk. Therefore, the traffic-oriented objectives don t make sense for such a trunk. 



2.3 Traffic Engineering Models 

We build a taxonomy of traffic engineering models, and use the following criteria for 
their classification: 

V choice of objectives: resource-oriented versus traffic-oriented traffic engineering; 

V number of paths per traffic trunk: single path versus multiple path traffic engi- 
neering. 

We model the single path traffic engineering problems, and demonstrate that models 
for multiple path traffic engineering problems are generalisations (linear programming 
relaxations) of these single path models. 

The first model we build is a model of a single path traffic engineering problem 
which targets load balancing and minimization of resource usage. We call this prob- 
lem Spte-ro (single path / resource-oriented). 



Problem Spte-ro 
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The objective function (i) is a linear combination of capacity usage and balance. A 
unit unbalance is penalized M times more than a unit capacity usage, where Af is a 
large number. This means that an algorithm solving this model will honor a high bal- 
ance more than a low capacity usage, (ii) and (vi) together force the use of a single 
path for each traffic trunk k, by allowing only one of the routing variables Ukp to be 
nonzero. The equalities (iii) are bundling constraints [6], which express the capacity 
used on a link as the sum of the capacities allocated to the paths crossing that link. 
(iv)-(v) define the objectives: balance and capacity usage. 

If we drop the inequalities (vi) from model Spte-ro, we open the possibility of 
choosing more than one path for forwarding the traffic on a traffic trunk. We call the 
new model Mpte-ro (multiple path / resource-oriented). Using multiple paths per 
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traffic trunk is feasible if the label switched routers are capable of distributing the 
traffic of one trunk over multiple paths. It enhances the traffic handling capabilities of 
the network. 

Spte-ro and Mpte-ro will be the models of choice for engineering trunks con- 
taining EF traffic. By choosing load-balancing as the first objective, we avoid that 
bottlenecks arise, and keep the delay and packet loss incurred by EF traffic flows low. 
By minimizing capacity usage as the second objective, no resources are wasted to 
achieve this. 

The second class of models we build target maximum fairness and throughput. 
Spte-TO (single path / traffic-oriented TE) is the name we give to the single path vari- 
ant, and Mpte-TO (multiple path / traffic-oriented TE) is the name of the multiple path 
variant. 

Problem Spte-to 
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In order to model the problem as a mixed integer linear programming problem, we 
need to introduce new variables. For each traffic trunk k, we associate a path flow 
variable ftp with each [5j^,tJ-path p. The value taken by this path flow variable is equal 
to the amount of capacity reserved for trunk k on the lines crossed by path p. 

The inequalities (ii) simply state that a path is used {utp=\) as soon as capacity is 
allocated to it (fkp>0). The equalities (iii) are flow conservation constraints [6], which 
equal the share received by a traffic trunk to the sum of the capacities allocated to that 
trunk on each of the paths available to that trunk. The only other things that are differ- 
ent from model Spte-ro are the objective function (i), and (vi)-(vii), which define the 
two objectives: fairness and throughput. Again, we obtain the multiple path variant 
Mpte-TO by dropping the constraints (viii). 
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Spte-to and Mpte-to will be the models of choice for engineering trunks contain- 
ing BE traffic. By choosing fairness as the first objective, there will be no arbitrary 
starvation of BE traffic. By choosing throughput as the second objective, one can 
expect that the network will be used to the fullest. 



Table 1. Number of constraints, total number of variables and number of integer variables 
for different traffic engineering algorithms 
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3 Algorithms 

The easiest problems to solve are Mpte-ro and Mpte-to. Since these are linear pro- 
gramming problems (they do not contain integrality constraints), they can be solved to 
optimality in a straightforward way using a simplex algorithm [7]. The number of 
constraints and the total number of variables are given in Table 1. Because the number 
of paths between every (ingress, egress)-pair increases superexponentially with the size 
of the network, we restrict the number of eligible paths for each trunk. We calculate 
these paths using the A:-shortest path algorithm described in [8]. 

Spte-ro and Spte-to are mixed integer linear programming problems. We solve 
these problems using a branch-and-bound algorithm [7], which is a tree search algo- 
rithm. When the number of integer variables increases, it is important to control the 
size of the search tree and the way in which the search tree is traversed, in order to 
avoid that it literally takes forever before the first feasible solution is found. We found 
that a combination of adapting the branching strategy, variable selection and limiting 
the total optimisation time leads to the best results. 

Since only one of the Ukp s can be nonzero for each traffic trunk, exploring the up 
branch first ensures that branches in the search tree are cut off more quickly, so that 
more feasible solutions can be investigated. By adopting this branching strategy, we 
find feasible solutions earlier in the tree search. 

It is intuitively clear that although non-shortest paths may be needed, only shortest 
paths allow a minimisation of the network resource usage. This means that the optimal 
solution will usually consist of a large number of short(est) paths and some longer 
paths. Therefore, we attach a branching priority that is inversely proportional to the 
hopcount of each path to the corresponding routing and flow variables. By influencing 
the variable selection rule in this way, we find better solutions earlier in the tree 
search. 
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For Spte-ro and Spte-to, the number of constraints, the total number of variables 
and the number of integer variables are given in Table 1. The use of the problem- 
specific heuristic we proposed in this section is even more important for Spte-TO than 
for Spte-ro, because the size of Spte-to increases faster with the size of the network 
and the number of trunks. 



4 Results 

We implemented the algorithms on Sun Solaris using C++. We used the 
CPLEX 6.5 [9] simplex and branch-and-bound algorithms to solve the linear and 
mixed integer linear programs. The time limit for the optimisation was fixed at 300 
seconds for all runs of the algorithm. When evaluating traffic engineering algorithms, 
care must be taken as far as the simulation conditions are concerned: the topology of 
the network, the number of traffic trunks and the type of demand for each trunk are of 
importance, and each has its influence on the final result. 

For each of the algorithms, we considered 50 topologies that were generated ran- 
domly using the algorithm proposed by Zegura et al. [10]. Each topology consisted of 
25 nodes. The number of links is varies around 70. We also considered an existing 
network, the US backbone network of MCI Worldcom [11]. Its topology consists of 
41 nodes and 1 12 links. It is considerably denser than the Zegura topologies and con- 
tains two node- and edge-disjoint paths for every border node pair. In all cases, the 
capacity of all the links was set to 10. 

Two types of trunk generation were used. For the randomly generated Zegura to- 
pologies, 10 nodes were randomly chosen as border nodes. Then, a trunk with unit 
demand was set up between eaeh pair of border nodes and in each direction. This 
resulted in a total of 90 trunks. For the MCI topology, we considered each of the 41 



Table 2. Average, best-case and worst-case fairness and balance increase compared to OSPF 
for 50 random topologies with 25 nodes and 90 traffic trunks 





SPTE-RO 


SPTE-TO 


MPTE-TO 


worst-case 


fairness 


(0 %) 


0% 


0% 




balance 


0% 


(0 %) 


(0 %) 


average 


fairness 


(21.9%) 


21.9% 


23.2 % 




balance 


16.0 % 


(16.0%) 


(16.6%) 


best-case 


fairness 


(111.1 %) 


111.1 % 


111.1 % 




balance 


55.56 % 


(55.56 %) 


(55.56 %) 



nodes as a border node, and set up 90 trunks between randomly chosen source and 
sink nodes, again with unit demand. In this case, the demands need not be symmetri- 
cal. 

Table 2 summarises the results obtained for the Zegura topologies. The average, 
best-ease and worst-case values of the performance parameters we identified (fairness. 
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throughput, balance and capacity usage) are listed for the population of 50 Zegura 
topologies. First we considered the network load as nominal (EF) traffic, and solved 
Spte-ro to obtain a TE solution. Then we considered the network load as excess (BE) 
traffic, and solved Spte-TO and Mpte-TO to obtain a TE solution. Once a TE solution 
has been fixed, the values of all the performance parameters we defined can be calcu- 
lated, even when some of them were not explicitely considered by the algorithm used 
for obtaining the TE solution (e.g. Spte-RO does not consider fairness and through- 
put). 

The results are expressed as increase percentages compared to a non traffic- 
engineered solution, that is, the reference we take is the routing that OSPF would en- 
force. The same results are shown in Table 3 for the MCI WorldCom topology. 



4.1 Traffic-Engineered versus Non-traffic-engineered Solution 

Table 2 clearly proves the beneficial influence that traffic engineering can have on the 
capability of a network of handling excess (BE) traffic. On average, there is a fairness 
increase of 21.9% for Spte-to and of 23.2% for Mpte-to. There is also a consider- 
able increase in throughput (17.9% for Spte-to and 31% for Mpte-to). The two 
extreme cases (best case and worst case) that are shown in the table indicate that the 
traffic-engineered solutions are always at least as good as OSPF. The best-case solution 
shows an enormous increase in fairness, combined with a large throughput increase for 
both algorithms with traffic-oriented objectives. 

Table 2 also proves the beneficial influence that traffic engineering has on load 
balancing of nominal (EF) traffic. On average, the balance that Spte-RO gives is 16% 
better than that of OSPF, at a cost of an increase of 2.6% in capacity usage (which is 



Table 3. Fairness and balance increase compared to OSPF for the MCI WorldCom topology 
with 90 traffic tmnks 





SPTE-RO 


SPTE-TO 


MPTE-TO 


fairness 


(10.0 %) 


10.0% 


10.0% 


balance 


10.0 % 


(10.0%) 


(10.0%) 



natural, since Spte-RO moves traffic away from shortest paths in order to improve the 
balance). 

Table 3, which summarises the results for the MCI WorldCom network, agrees well 
with the results obtained for the Zegura topologies. 



4.2 Resource-Oriented versus Traffic-Oriented Objectives 

Table 2 shows that the resource-oriented and traffic-oriented traffic engineering algo- 
rithms show the same performance as far as fairness and balance are concerned. This 
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is natural, since there is a one-to-one relationship between the optimal balance calcu- 
lated by Spte-Ro and the optimal fairness calculated by Spte-To. 

The difference between the route calculations of Spte-Ro and Spte-To reveals it- 
self in the capacity usage and the throughput. We checked what would happen if we 
would calculate routes for EF traffic using Spte-To, that is, considering the nominal 
bit rate requirements as if they were excess bit rate requirements. From Table 4 it is 
clear that the routes calculated by Spte-Ro lead to a significantly lower capacity usage 
(on average about 18% lower). 

On the other hand. Table 5 shows that Spte-To is significantly better for maximis- 
ing the throughput of the network, that is, it is not a good idea to calculate routes for 
excess traffic trunks in the same way as for premium service traffic trunks. The 
throughput that can be obtained using Spte-TO is roughly 22% higher than that of 
Spte-ro. Generally speaking, when used for routing excess traffic, Spte-ro increases 
the fairness but not the throughput of the network, compared to OSPF. 



4.3 Multi-path versus Single-Path Solution 

We limit the scope of this comparison to Spte-to and Mpte-to. Table 2 shows that a 
solution allowing multiple paths for each traffic trunk can considerably improve the 
potential throughput of the network. It is therefore worthwhile to investigate the cost 



Table 4. Capacity usage increase compared to 
OSPF for 50 random topologies with 25 
nodes and 90 traffic trunks 





SPTE-RO 


SPTE-TO 


worst-case 
average 
best case 


17.95 % 
6.82 % 
0% 


55.56 % 
24.80 % 
10.64 % 



Table 5. Throughput increase compared to 
OSPF for 50 random topologies with 25 
nodes and 90 traffic trunks 





SPTE-RO 


SPTE- 

TO 


worst-case 


-25.4 % 


-16.6 % 


average 


-4.2 % 


17.9% 


best case 


17.4% 


73.5 % 



of this additional flexibility, that is, how many more LSPs have to be configured in 
order to attain these increases in throughput and fairness. We noticed that the average 
number of non-shortest paths for the multi-path solution is almost the same as for the 
single-path solution. An investigation of each individual solution unveiled two reasons 
for this. First, there is only a very limited amount of traffic trunks in the multi-path 
solution for which traffic is carried on more than one path. Second, if more than one 
path is used, the shortest path is almost always among them, whereas for the single- 
path solution, the one path that is chosen is often not the shortest path. Although the 
multi-path solution increases the total number of paths, this increase is generally lim- 
ited. On average the number of used paths increased with 11%. 

The above conclusions were drawn for a fixed number of traffic trunks. It can be 
expected, however, that the difference between a multi-path and a single path solution 
changes as a function of the ratio between the number of traffic trunks and the network 
size. To verify this we calculated the evolution of the fairness and throughput, for a 
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single network, as a function of the number of trunks. The network we used is the MCI 
WorldCom US backbone. For this network, 200 traffic trunks were generated ran- 
domly. We start the simulation by randomly selecting 20 of these traffic trunks, and 
solve Mpte-to, Spte-to and OSPF. This process is repeated 25 times and for each 
iteration, a different selection of traffic trunks is made. Then, the number of traffic 
trunks is increased to 40. This procedure is repeated until the number of traffic trunks 
reaches 200. For each number of trunks, the fairness and throughput increase com- 
pared to the OSPF solution is averaged over all 25 runs of the algorithms. The result- 
ing fairness and throughput increase plots are shown in Fig.l. 

From Fig. 1, two important conclusions can be drawn. First, the benefit of traffic- 
engineering exhibits a decreasing trend as a function of the number of trunks in the 
network. The gain seems to saturate at a value around 10% for both fairness and 
throughput. This proves that also for larger problems, an important benefit can be 
expected from traffic engineering (we have no explanation for the oscillating behavior 
of the fairness gain curves; the data we have indicate that it is due to the behavior of 





Fig. 1. Average fairness and throughput increase of Spte-TO and Mpte-TO as a func- 
tion of the number of trunks in the problem 

the OSPF solution). Second, the difference between Mpte-to and Spte-to decreases 
with the number of trunks in the network. This can be explained from the observation 
that an increase in the number of traffic trunks will increase the average load on all 
links in the network. Therefore, the fairness or throughput gain obtained by using 
(additional) longer alternative paths in Mpte-TO may actually be offset by the increase 
of network resource utilization, which may be detrimental to other traffic trunks. The 
number of bifurcated trunks will thus converge to zero, which is equivalent to the 
convergence of Mpte-to to Spte-to. This effect can be interpreted in two ways. One 
could say that for a large number of traffic trunks, only Spte-TO has to be considered, 
because it guarantees close to optimal fairness and throughput, and minimises the 
number of LSPs that has to be set up. On the other hand, one could also say that solv- 
ing Mpte-to and eliminating all but the best used path for each trunk yields a solution 
that is close to optimal, while considerably improving the speed of the algorithm. 
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The fairness and throughput obtained by Mpte-to are the best that can be achieved 
with any TE algorithm. Therefore, a measure Ox (where X can be either F or T) for the 
optimality of the single path solution is given by 

o, - 100 

^ MPTEVrO vx OSPF J 

which is the percentage of the fairness and throughput gain obtained by Mpte-TO that 
can be obtained, on average, by Spte-to. If we apply this formula to the results of 
Figure 1, Spte-TO obtains between 80% and 90% optimality for fairness and around 
70% optimality for throughput. 



5 Conclusion 

In this paper, we gave a mathematical programming formulation of four traffic engi- 
neering problems. Algorithms were presented for solving them. The choice of algo- 
rithm was particularly important when solving mixed integer linear programs, because 
of the rapid increase of their solution time with the size of the problem. 

The results obtained from these algorithms were compared by means of statistical 
simulations on randomly generated topologies and by means of a case study on an 
existing US network backbone topology. We showed that traffic engineering contrib- 
utes significantly to the traffic handling capacity of the network. Finally, we showed 
that a single-path solution offers results comparable to that of a multi-path solution, 
especially for a large number of traffic trunks. 
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Abstract. Multi-Protocol Label Switching (MPLS) is currently heavily 
used inside autonomous systems for traffic engineering and VPN pur- 
poses. We study the cost of using MPLS to carry interdomain traffic by 
analyzing two one day traces from two different ISPs. Our study shows 
that a hybrid MPLS-I-IP solution can significantly reduce the number 
of LSPs and signalling operations by using MPLS for high bandwidth 
flows and pure IP for low bandwidth flows. However, the burstiness of 
the interdomain LSPs could be a problem. 



1 Introduction 

One of the basic assumptions of IP networks such as the Internet is that all 
IP packets are individually routed through the network based on the addresses 
contained in the packet header. This assumption is still valid today, but the com- 
plexity of per-packet routing coupled with the need to sustain the exponential 
growth of the Internet in terms of capacity and number of attached networks 
has led researchers to propose alternative solutions where per-packet routing 
on each intermediate hop is not always required. The first alternative solutions 
such as IP switching [NML98] and others tried to reduce the complexity of the 
packet forwarding operation by leveraging on the available ATM switches and 
establishing short-cut virtual circuits for IP flows. Later on, the IETF decided 
to standardize one IP switching solution under the name Multi-Protocol Label 
Switching (MPLS). 

Although IP switching was initially proposed as a solution to increase the 
performance of core routers by using label-swapping instead of traditional IP 
routing in the core, this is not its only benefit. The main benefit of MPLS today 
is that it allows a complete decoupling between the routing and forwarding 
functions. With traditional IP routing, each router has to individually route and 
forward each packet. A consequence of this is that a packet usually follows the 
shortest path inside a single domain. With MPLS, IP packets are carried inside 
Label Switched Paths (LSPs). These LSPs are routed at LSP establishment 
time and the core routers forward the packets carried by the LSPs only based 
on their labels. This decoupling between forwarding and routing allows MPLS 
to efficiently support traffic engineering inside autonomous systems as well as 
transparent VPN services. 
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In this paper, we evaluate the cost of extending the utilization of MPLS across 
interdomain boundaries instead of restricting MPLS inside domains as is usually 
done today. The remainder of this paper is organized as follows. In section 2 
we summarize the advantages of using MPLS across interdomain boundaries. To 
evaluate the cost of using MPLS in this environment, we collected traces from two 
different ISPs as described in section 3. We then analyze these traces in section 4 
and show that a pure MPLS solution to carry all interdomain traffic would be 
too costly from a signalling point of view. We then analyze hybrid solutions 
where a fraction of the interdomain traffic is handled by MPLS while another 
fraction is handled by traditional IP hop- by-hop routing in sections 5 and 6. 



2 Using MPLS at Interdomain Boundaries 

Today, MPLS is considered as a key tool to be used inside (large) autonomous 
systems. This utilization of MPLS has been supported by a lot of research and 
development during the last few years. In contrast, the utilization of MPLS for 
interdomain traffic has not been studied in details. BGP has been modified to 
distribute MPLS labels and the RSVP-TE and CR-LDP signalling protocols 
support the establishment of interdomain LSPs. However, MPLS has not to our 
knowledge already been used to carry operational interdomain traffic. 

In today’s Internet, the behavior of interdomain traffic is mainly driven by 
several underlying assumptions of the BGP routing protocol. The first assump- 
tion is that once a border router announces an address prefix to a peer, this 
implies that this prefix is reachable through the border router. This reachability 
information implies that the border router is ready to accept any rate of IP 
packets towards the announced prefix. With BGP, the only way for a router to 
limit the amount of traffic towards a particular prefix is to avoid announcing this 
prefix to its peers. A second assumption of BGP is that all traffic is best-effort. 
This assumption was valid when BGP was designed, but will not remain valid 
in the near future with the deployment of applications such as Voice over IP or 
multimedia and streaming applications and the increasing needs to provide some 
QoS guarantees for “better than best-effort” traffic (e.g. Intranet, Extranet or 
traffic subject to specific service level agreements). 

The utilization of MPLS for interdomain traffic could provide two advan- 
tages compared with traditional hop-by-hop IP routing. First, the utilization of 
MPLS will allow a complete decoupling between the routing and the forwarding 
functions. A border router could use a modified^ version of BGP to announce 
a MPLS-reachability for external prefixes. This MPLS-reachability means that 
a peer could send traffic towards these announced prefixes provided that a LSP 
is first established to carry this traffic. At LSP establishment time, the border 
router will use connection admission control to decide whether the new LSP can 
be accepted inside its domain or not. 

^ The extensions to BGP required to support MPLS-reachability at interdomain 
boundaries are outside the scope of this paper. In this paper, we simply assume that 
some mechanism exists to announce routes reachable through MPLS. 
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A second advantage is that it would be easy to associate QoS guarantees to 
interdomain LSPs. These guarantees would be a first step towards the extension 
of traffic engineering across interdomain boundaries and could also be a way of 
providing end-to-end QoS by using guaranteed LSPs as a kind of virtual leased 
lines across domains. 

3 Measurement Environment 

To better characterize the flows that cross interdomain boundaries, we considered 
traffic traces from two different ISPs. By basing our analysis on two different 
networks, we reduce the possibility of measurement biases that could have been 
caused by a particular network. The two ISPs had different types of customers 
and were both multi-homed. 



3.1 The Studied ISPs 

The first ISP, WIN (http : //www . win . be) , was at the time of our measurements 
a new ISP offering mainly dialup access to home users in the southern part of 
Belgium. We call this ISP the “dialup” ISP in the remainder of this paper. When 
we performed our measurements, the dialup ISP was connected through El links 
to two different transit ISPs and at the Belgian national interconnection point, 
having peering agreement with about ten ISPs there. 

The second ISP, Belnet(http : //www. belnet .be ), provides access to the 
commodity Internet as well as access to high speed European research networks 
to universities, government and research institutions in Belgium. We call this ISP 
the “research” ISP in the remainder of this paper. Its national network is based 
on a 34 Mbps backbone linking major Belgian universities. The research ISP 
differs from the dialup ISP in several aspects. First, the “customers” of the re- 
search ISP are mainly researchers or students with direct high speed connections 
to the 34 Mbps backbone, although some institutions also provide dialup service 
to their users. Second, the research ISP is connected to a few tens of external 
networks with high bandwidth links. It maintains high bandwidth peerings with 
two transit ISPs, the Dutch SURFNET network and is part of the TEN-155 
European research network. In addition, the research ISP is present with high 
bandwidth links at the Belgian and Dutch national interconnection points with 
a total of about 40 peering agreements in operation. 

3.2 Collection of Traffic Traces 

To gather interdomain traffic traces, we relied on the Netflow [Cis99] measure- 
ment tool supported on the border routers of the two ISPs. Netflow provides 
a record at the layer-4 flow level. For a TCP connection, Netflow will record 
the timestamp of the connection establishment and connection release packets 
as well as the amount of traffic transmitted during the connection. For UDP 
flows, Netflow records the timestamp of the first UDP packet for a given flow. 



144 Steve Uhlig and Olivier Bonaventure 



the amount of traffic and relies on a timeout for the ending time of a UDP flow. 
The Netflow traces are less precise than the packet level traces used in many 
measurement papers [TMW97,NML98,FRC98,NEKN99] since Netflow does not 
provide information about the arrival time and the size of individual packets 
inside a layer 4 flow. However, Netflow allows us to gather day long traces cor- 
responding to several physical links. Such a traffic capture would be difficult 
with per-link packet capture tools. The Netflow traces were collected at the 
border routers of the ISPs for unicast traffic and stored with a one-minute gran- 
ularity. We recorded Netflow traces for the incoming traffic of the dialup ISP 
and incoming and outgoing traffic for the research ISP. Multicast traffic is not 
included in the traces we consider in this paper. The trace of the dialup ISP was 
collected in September 1999 while the trace for the research ISP was collected 
in December 1999. 

The utilization of Netflow forces us to approximate the layer-4 flows as equiv- 
alent to fluid flows. More precisely, a flow transmitting M bytes between Tgtart 
and Tstop is modeled as a fluid flow transmitting M/(Tstop — T start) bytes ev- 
ery second between T start and Tstop- This approximation obviously leads to an 
incorrect estimation of the burstiness of the traffic and it can be expected that 
the utilization of Netflow underestimates the burstiness of interdomain flows. 

3.3 Daily Traffic Evolution 

The first noticeable difference between the two ISPs is the total amount of traffic 
carried by each ISP. The total amount of daily incoming traffic for the dialup ISP 
is about 37 GBytes. The research ISP received 390 GBytes during the studied 
day and transmitted 158 GBytes during the same period. The research ISP 
receives thus ten times more traffic than the dialup ISP. A closer look at the 
traffic of the research ISP shows that this traffic is mainly driven by TGP. For 
this ISP, 97.5 % of the incoming traffic in volume was composed of TGP packets. 
For the outgoing traffic, 95.8 % of the total volume was composed of TGP traffic. 
This prevalence of TGP is similar to the findings of previous studies [TMW97]. 
This implies that UDP-based multimedia applications do not seem to be yet an 
important source of unicast traffic, even in a high bandwidth network such as 
the research ISP. 

A second difference between the two ISPs is the daily evolution of the in- 
terdomain traffic. Figure 1 (left) shows that for the dialup ISP the peak hours 
are mainly during the evening while for the research ISP the peak hours are 
clearly the working hours (figure 1 (right)). For the dialup ISP, the links to the 
transit ISPs are congested during peak hours. For the research ISP, the links to 
the two transit ISPs are congested during peak hours, but not the links towards 
the interconnection points and the research networks with which the research 
ISP peers. For the research ISP, there is a clear asymmetry between the incom- 
ing and the outgoing traffic. The amount of incoming traffic is more than four 
times higher than the amount of outgoing traffic during peak hours. Outside 
peak hours, the amounts of incoming and outgoing traffic for the research ISP 
are similar. A similar asymmetry exists for the dialup ISP. 
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Fig. 1. Daily traffic evolution for dialup (left) and research (right) ISP 



In the remainder of this paper, we focus our analysis on the incoming traffic 
since this is the predominant traffic for both ISPs. 

4 Cost of a Pure MPLS Solution 

Replacing traditional IP routing at border routers by MPLS switching would 
clearly have several implications on the performance of the border MPLSs- 
witches. MPLS could in theory be used with a topology driven or a traffic 
driven LSP establishment technique. By considering one LSP per network prefix, 
a topology driven solution would require each autonomous system to maintain 
one LSP towards each of the about 70000 prefixes announced on the Internet. 
Such a pure topology-driven LSP establishment technique would imply the cre- 
ation and the maintenance of 70000 LSPs by each autonomous system. This 
number of LSPs is clearly excessive. 

To reduce the number of interdomain LSPs, we evaluate in this paper the 
possibility of using traffic-driven LSPs, i.e. LSPs that are dynamically estab- 
lished when there is some traffic towards some prefixes and released during idle 
periods. More precisely, we consider the very simple LSP establishment tech- 
nique described in figure 2. In this section, we assume that trigger is equal to 1 
byte, i.e. all IP traffic is switched. 

To evaluate the cost of using MPLS for interdomain traffic, we have to con- 
sider not only the number of established LSPs, but also the number of signalling 
operations (i.e. LSP establishment and release). When considering interdomain 
traffic, the cost of using MPLS is not simply the CPU processing cost of the 
signalling messages by each intermediate router. We do not except that this 
cost would be the bottleneck. When an interdomain LSP is established, it will 
typically pass through several autonomous systems. When a border router will 
receive a LSP establishment request, it will have to verify whether the LSP can 
be accepted given the network policies, the current utilization of autonomous 
system links as well as authentication, billing and accounting issues. The han- 
dling of all these issues might be expensive. 
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After each one minute period and for each prefix p: 

// Traffic (p) is traffic from prefix p during last minute 
if (Traffic(p)> trigger) 

{ 

if (LSP(p) was established) 

LSP(p) is maintained; 

else 

Establish LSP(p); 

} 

else 

{ 

if (LSP(p) was established) 

LSP(p) is released; 

} 

Fig. 2. Simple LSP establishment technique 





Fig. 3. Number of active LSPs for dialup (left) and research (right) ISP 



Figure 3 compares the total number of LSPs that the border router of the 
ISP needs to maintain for the dialup (left) and the research ISP (right). In this 
figure, we only consider the incoming traffic as mentioned previously. This figure 
shows two interesting results. First, as expected, the number of LSPs follows 
the daily evolution of the traffic. Both ISPs need to maintain a larger number 
of LSPs during peak hours than during the night. Second, the research ISP 
with about ten times more traffic than the dialup ISP needs to maintain about 
ten times more interdomain LSPs than the dialup ISP. This means that with 
more capacity the research ISP communicates with a larger number of network 
prefixes than the dialup ISP rather than receiving more traffic from the same 
number of network prefixes. While the absolute number of LSPs stays in the 
range of 1000-2000 for the dialup ISP, the research ISP would require more than 
10000 simultaneous LSPs during peak hours in order to switch every packet on 
a LSP. This number, given the cost of establishing interdomain LSPs, might be 
too high. 
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Fig. 4. Signalling overhead for dialup (left) and research (right) ISP 



The second performance factor to be considered is the number of signalling 
operations. Figure 4 shows the number of per-minute signalling operations (LSP 
establishment and LSP release) for the dialup (left) and the research (right) ISP. 
For the dialup ISP, several hundreds of LSPs need to be established or released 
every minute. This is an important number compared to the 1000-2000 LSPs 
that are maintained by this ISP. For the research ISP, about 1000 signalling 
operations need to be performed every minute. This implies that during peak 
hours, 10 % of the LSPs are modified during each one minute interval. For 
both ISPs, the number of signalling operations would probably preclude the 
deployment of a pure MPLS solution to carry all interdomain traffic. 

5 Reducing the Number of LSPs 

The previous section has shown the high cost of a pure MPLS solution to han- 
dle interdomain traffic. A pure MPLS solution is probably too costly from the 
signalling point of view, even for the relatively small ISPs that we considered in 
this study. To allow the efficient utilization of MPLS for interdomain traffic, we 
clearly need to reduce the number of interdomain LSPs as well as the number 
of signalling operations. This could be done in two different ways. 

A first way would be to aggregate traffic from several network prefixes inside 
a single LSP. This would require a close cooperation with the routing protocol to 
determine which network prefixes can be aggregated inside each LSP. A potential 
solution was proposed in [PHSOO]. Space limitations prohibit us to discuss this 
issue further in this paper. 

A second way would be to utilize MPLS for high bandwidth flows and normal 
IP routing for low bandwidth flows. This would allow to benefit from MPLS 
capabilities to handle the higher bandwidth flows while avoiding the cost of 
LSPs for low bandwidth flows. This could be coupled with different types of 
routing for the two types of flows as proposed in [SRS99] where a different type 
of routing was developed for long-lived flows. 

To evaluate whether the interdomain traffic of our two ISPs could be sepa- 
rated in two such classes, we analyzed the total amount of traffic received from 
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each network prefix during the studied day. Figure 5 (left) shows that during this 
day, the dialup ISP received IP packets from slightly more than 15.000 different 
network prefixes (out of about 70000 announced prefixes on the global Internet). 
However, it also shows that these prefixes are not equal. The most important 
prefix sends about 3.9% of the total traffic that enters the dialup ISP. The total 
traffic from the top 100 prefixes seen by the dialup ISP corresponds to 50% of 
the daily incoming traffic and 560 (resp. 1800) prefixes are required to capture 
80% (resp. 95 %) of the daily incoming traffic. 



Dialup ISP : Incoming traffic 




Address prefix/Autonomous System 



Research ISP : Incoming traffic 




Address prefix/Autonomous System 



Fig. 5. Per prefix daily traffic distribution for dialup (left) and research (right) 
ISP 



A similar trends exists for the research ISP as shown in figure 5 (right). The 
research ISP received IP packets from 18.000 different network prefixes during 
the studied day. For this ISP, the most important prefix sends 3.5% of the daily 
traffic. The total traffic from the top 100 prefixes seen by this ISP corresponds 
to 49.5 % of the daily traffic. Furthermore, the top 500 (resp. 1820) prefixes 
transmit 80 % (resp. 95 %) of the daily traffic towards the research ISP. 

Based on this analysis, it seems possible to capture a large portion of the 
traffic by only considering the prefixes that transmit a large amount of data or 
the high bandwidth flows. The separation of the traffic into two different classes 
must be performed online at the border routers. For this, very simple techniques 
are required. 

6 Cost of a Hybrid MPLS+IP Solution 

As a simple mechanism to segregate the traffic between high bandwidth and low 
bandwidth flows, we consider the procedure described in figure 2 with a large 
trigger. This means that a LSP is maintained if we saw at least trigger bytes 
for a given prefix during the last minute. A LSP is released if we saw less than 
trigger bytes for a given prefix during the last minute. We assume that a LSP 
can be instantaneously established and thus seeing more than trigger bytes for a 
minute suffices to consider a LSP (dedicated to that prefix) as active (established 
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or maintained) for the whole minute. This scheme is very simple, since only the 
last minute is considered in order to decide the action to perform on the LSP. 
Other LSP establishment schemes are not considered in this paper due to space 
limitations. 

By using this trigger-based mechanism, we use MPLS LSPs for high band- 
width flows and normal IP routing for low bandwidth flows. Figure 6 shows 
for both ISPs the amount of traffic captured by the high bandwidth flows as 
a function of the trigger and the number of signalling operations that are re- 
quired to establish and release these high bandwidth LSPs. The captured traffic 
is expressed as a percentage of the total daily traffic of the ISP. The number 
of signalling operations is expressed as a percentage of the required signalling 
operations when all the traffic is switched by MPLS ( i.e. trigger=\) 



Dialup ISP : incoming traffic 



Research ISP, incoming traffic 





Fig. 6. Impact of LSP trigger for dialup (left) and research (right) ISP 



Figure 6 (left) shows that for the dialup ISP if we only use MPLS for the 
prefixes that transmit at least 10 KBytes per minute, we still capture 97% of 
the total traffic while we reduce the number of signalling operations by a factor 
of 3. If we use MPLS for the prefixes that transmit at least 1 MBytes per 
minute, we only capture 10 % of the daily traffic. A similar situation holds for 
the research ISP. In this case, figure 6 (right) shows that if we used MPLS for 
prefixes that transmit at least 1 MByte per minute, then we still capture 58 % 
of the daily traffic and we reduce the number of signalling operations by a factor 
of 25 compared to a pure MPLS solution. Figure 6 shows clearly that using 
MPLS only for high bandwidth flows allows to reduce the number of signalling 
operations while still maintaining a good capture ratio. 

Based on figure 6, a trigger between 10 and 100 KBytes (resp. 10 KBytes) for 
the research ISP (resp. dialup ISP) would be a reasonable compromise between 
the amount of traffic captured and the number of signalling operations. However, 
the number of signalling operations and the percentage of the captured traffic 
are not the only performance factors that we need to take into account. Another 
performance factor is the lifetime of the interdomain LSPs. Ideally, such a LSP 
should last for a long period of time so that the cost of establishing this LSP can 
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be amortized over a long period. If the LSPs only last for a few minutes, then it 
would be difficult to dynamically establish high bandwidth interdomain LSPs. 



Dialup ISP : incoming traffic 




LSP duration 

Fig. 7. LSP lifetime for dialup ISP 



To evaluate the duration of these LSPs, we plot in figures 7 and 8 the cumu- 
lative percentage of the traffic that is carried by LSPs lasting at least x minutes. 
Figure 7 considers the cumulative amount of traffic that is carried by the LSPs 
as a function of their lifetime for the dialup ISP. This figure shows that if we 
consider a pure MPLS solution (trigger=l byte), 17.5 % of the total traffic is 
carried by LSPs that remain established for up to five minutes. Thus, LSPs 
lasting more than five minutes capture more than 82.5 % of the total traffic. 
Five minutes is probably too short a duration for interdomain LSPs. If we now 
consider the LSPs that last for at least 30 minutes, they capture 47 % of the 
total traffic. However, when we utilize MPLS only for high bandwidth flows, the 
lifetime of the LSPs decreases. For example, if we consider the 100 KBytes LSPs, 
these LSPs only capture 63 % of the total traffic. Within the 100 KBytes LSPs, 
the LSPs that last for at least 30 minutes capture only 15 % of the daily traffic 
of the dialup ISP. 

Figure 8 shows that the behavior of the research ISP is slightly different. If 
we consider a pure MPLS solution, then 18.3 % of the daily traffic is captured by 
LSPs that last up to five minutes. If we consider the LSPs that remain active for 
at least 30 consecutive minutes, these LSPs capture 65.5 % of the daily traffic. 
These values are better than for the dialup ISP. However, if we now consider 
the high bandwidth LSPs, we see an important decrease in the lifetime of these 
LSPs. If we consider the LSPs that transmit at least 10 KBytes per minute (a 
rather low bandwidth flow for the research ISP), they capture 97.5 % of the 
daily traffic. However, the 10 KBytes LSPs that last for at least 30 minutes only 
capture 38.2 % of the daily traffic. The situation is even worse when we consider 
the LSPs that carry at least 1 MByte per minute. All these LSPs carry 58 % 
of the daily traffic. However, among these high bandwidth LSPs, the LSPs that 
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last only a single minute carry 36 % of the daily traffic of the research ISP. The 
high bandwidth LSPs that have a duration longer that five minutes carry only 
3.6 % of the daily traffic and there are almost no LSP that remains active for 
more than 30 minutes. 



Research ISP : incoming traffic 




LSP duration 

Fig. 8. LSP lifetime for research ISP 



The difference between the evolution of the LSP lifetime with the bandwidth 
of the LSP for the two ISPs can probably be explained by two factors. The 
first factor is the congestion level. Most of the incoming traffic of the dialup 
ISP is received through its two heavily congested transit ISP links. On the other 
hand, the external links of the research ISP, especially those towards the research 
networks and the interconnection points, are only lightly congested. The second 
factor is the maximum bandwidth that a user can consume. A user of the dialup 
ISP is limited by its dialup modem while a user of the research ISP may easily 
receive traffic at several Mbps. 

The burstiness of interdomain traffic of the research ISP implies that it would 
be difficult to utilize guaranteed bandwidth interdomain LSPs to optimize the 
traffic of the research ISP. A closer look at the behavior of these LSPs shows 
that it is difficult to predict the bandwidth that one LSP would need for the 
upcoming minute. For the research ISP, the solutions proposed in [DGG+99] are 
not applicable. Either the reserved bandwidth is much smaller than the traffic 
carried by the LSP or the reservation is much larger than the actual traffic. In 
both cases, the utilization of guaranteed bandwidth interdomain LSPs does not 
seem to be a good solution to perform interdomain traffic engineering for our 
research ISP. This is due to the current nature of the best-effort traffic and the 
large capacity of our research ISP. The situation would probably change with the 
deployment of differentiated services and the utilization of traffic conditioners 
such as shapers. QoS sensitive applications such as multimedia, streaming or 
voice over IP would behave differently from the best-effort applications we found 
in our two ISPs. 
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7 Conclusion 

In this paper, we have analyzed the cost of using MPLS to carry interdomain 
traffic. Our analysis was carried out by studying full day traffic traces for two 
different ISP. We have shown that utilizing exclusively MPLS to handle all the 
interdomain traffic would be too costly when considering the number of LSPs 
and the number of signalling operations that are required to establish and release 
dynamically such LSPs. 

We have then shown that the cost of MPLS could be significantly reduced by 
using MPLS for high bandwidth flows and traditional hop-by-hop IP routing for 
low bandwidth flows. We have evaluated a simple trigger-based mechanism to 
distinguish between the two types of LSPs. The utilization of such a mechanism 
can significantly reduce the number of signalling operations and the number of 
LSPs. The optimal value for the trigger depends on the total bandwidth of the 
ISP. However, we have also shown that the burstiness of the interdomain LSPs 
could be a significant burden concerning the utilization of MPLS to perform 
interdomain traffic engineering with guaranteed bandwidth LSPs. 
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Abstract. In order to provide various services with different quality 
requirements, the current Internet is expected to turn into a QoS based 
Internet under the Differentiated Service (DiffServ) architecture. A 
variety of works have been done in the field of constraint based routing 
to provide QoS guaranteed or assured services by developing novel 
routing protocols and algorithms. However, most of these efforts focus 
on intra-domain routing rather than inter-domain routing. In this paper, 
we discuss issues of finding routes with QoS requirements among 
multiple domains, called inter-domain QoS routing. We first investigate 
the needs and problems faced when introducing inter-domain QoS 
routing into the Internet. Then, we present a model for inter-domain 
QoS routing and describe its building blocks. Finally, we present five 
mechanisms for operating inter-domain QoS routing in DiffServ 
networks. 



1 Introduction 

Today s Internet consists of domains also called Autonomous Systems (ASs). An AS 
is usually a set of routers under a single administration, using an intra-domain routing 
protocol and common metrics to route packets within the AS while using an inter- 
domain routing protocol to route packets to other ASs. The overall Internet topology 
may be viewed as an arbitrary interconnection of ASs. With the marvelous success of 
Internet in recent years. Border Gateway Protocol (BGP) has become the de facto 
inter-domain routing protocol in the Internet. BGP shows distinguished flexibility 
and robustness in connecting ASs. However, BGP does not provide any exact QoS 
support of traffic flows. 

On the other hand, the current Internet is expected to become a QoS based Internet 
in which various services with QoS requirements will be provided easily. How to 
provide QoS support is becoming one of the hottest topics in the Internet community 
at present. Numerous works have been done in various aspects including traffic 
engineering and network management [1], [2]. Among them, constraint based routing 
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is gradually becoming an essential meehanism for selecting routes with requirements 
for additional routing metrics, e.g., delay and available bandwidth, or administrative 
policies [3], [4], An objective of constraint based routing is to aid in managing the 
traffic and the efficient utilization of network resources by improving the total 
network throughput. Moreover, eonstraint based routing provides flexibility in 
support of various services. 

Meanwhile, a demand exists for establishing conneetions with QoS requirements 
of speeific flows, or even building QoS based networks among multiple ASs. For 
example, a large QoS based virtual private network (VPN) for a worldwide eompany 
might be built up in an efficient and eeonomical way through cooperation of several 
network operators, eaeh of which might manage a separate AS. Therefore, issues of 
constraint based routing over multiple domains naturally eome into being and eall for 
solutions. Unfortunately, most of previous works on constraint based routing are 
limited within the scope of intra-domain routing. 

In this paper, we investigate the issues of inter-domain QoS routing. In particular, 
we present five mechanisms for operating inter-domain QoS routing in DiffServ 
networks. These mechanisms can be directly used in DiffServ IP networks with 
several existing inter-domain routing protoeols (e.g., BGP, IDRP) after possibly 
minor modifieations in those protocols. 

The remainder of this paper is structured as follows. In section 2, we give the 
general background information on the traditional inter-domain routing protocols and 
present the goals and criteria for inter-domain QoS routing in section 3. In section 4, 
we discuss problems faced when introducing inter-domain QoS routing into the 
Internet. We present and describe a model for inter-domain QoS routing and its 
building blocks in section 5. In section 6, we present five mechanisms for operating 
inter-domain QoS routing in DiffServ networks. In section 7, we briefly describe our 
works on the development of a routing simulator for investigating these mechanisms. 
Some conclusions and future works are given in the final section. 



2 Background on Inter-domain Routing 

The first inter-domain routing protocol, i.e.. Exterior Gateway Protocol (EGP), 
appeared in the 80s. EGP introduced the concept of AS and supported the exchanging 
of network reachability information between ASs. As the Internet grew in the 90s, 
EGP was replaced by BGP because EGP only supported the backbone-centered tree 
topology. BGP uses the path vector approach for loop avoidance. BGP is capable of 
supporting interconnections of heterogeneous networks with arbitrary topologies. As 
a result, BGP has become the most widely used inter-domain routing protocol in the 
Internet. The latest version of BGP is BGP-4, which introduces support of the 
Classless Inter-Domain Routing (CIDR). CIDR was developed as an immediate 
solution to the problems caused by the rapid growth of Internet, e.g.. Class B 
exhaustion and routing table explosion. 

Unlike interior routing protocols such as RIP and OSPF using a single criteria for 
route selection, i.e., the shortest path, routing in BGP is policy driven. Policy routing 
refers to any form of routing that is influenced by factors other than merely picking 
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the shortest path. Each AS is free to choose its own set of policies, which will allow 
or not allow transit data from and to other ASs. These polices possibly include several 
items, such as Acceptable-Use policies (AUP), the selection of providers and even a 
particular quality of service. 

Meanwhile, several other advanced inter-domain routing protocols proposed 
recently can provide better support of policy constraints and scalability: Inter-Domain 
Policy Routing (IDPR) protocol uses the link-state approach, domain-level source 
routing and superdomains [5]; Viewserver Hierarchy Query Protocol (VHQP) 
combines domain-level view with a novel hierarchical scheme, that is, domain-level 
views are not maintained by every router but by special nodes called viewservers [6]; 
Source Demand Routing Protocol (SDRP) provides a mechanism for route selection 
to support provider selection and quality of service selection [7]. 

The policy constraints supported by the above protocols could naturally be 
incorporated into the requirements of traffic flows. Thus, it is possible to develop new 
inter-domain QoS routing protocols on the basis of the traditional inter-domain 
routing protocols. 



3 Goals and Criteria 

In general. Inter-domain QoS routing aims to improve the interoperability of networks 
through providing efficient routes for various services with simple methods, and to 
increase the efficient utilization of limited network resources. 

To achieve this goal, inter-domain QoS routing should cooperate with other QoS 
related works, e.g., traffic engineering mechanisms and signaling protocols, which are 
all devoted to realize a QoS based Internet [I], [3]. 

The current Internet is a very large scale network containing tens of thousands of 
ASs, which are arbitrarily connected using inter-domain routing protocols. Therefore, 
introducing QoS constraints into inter-domain routing must obey the following 
criteria: 

V Compatibility 

Inter-domain QoS routing protocols must be compatible with the inter-domain 
routing protocols currently used in the Internet, e.g., BGP. They should also support 
policy routing and be capable of exchanging and understanding the route information 
of BGP or other inter-domain routing protocols. 

We suggest to develop a new inter-domain QoS routing protocol on the basis of 
several (best effort) inter-domain routing protocols, e.g., BGP, IDPR, SDRP, etc. For 
example, it is possible to combine IDPR with SDRP for selecting a path with QoS 
constraints and forwarding data traffic along the path. Here, IDPR is expanded to 
accommodate exact QoS parameters for specific data flows. 

V Scalability 

Inter-domain QoS routing must be capable of scaling up to very large networks, 
i.e., with a very large number of domains. In order to achieve scalability, several 
technologies such as hierarchy and flow aggregation are likely to be used. 
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V Flexibility 

The current Internet incorporates diverse heterogeneous networks with arbitrary 
size, topology and management policies. Inter-domain QoS routing protocols must be 
flexible enough to accommodate these networks. 

V Robustness 

Inter-domain QoS routing protocols must automatically adapt to resource changes, 
and keep stable and robust against accidents, e.g., links or nodes failures, etc. 

V Feasibility 

Feasibility refers to all factors that affect the development of inter-domain QoS 
routing. In general, inter-domain QoS routing introduces a tradeoff between 
performance improvement and complexity of implementation and management and 
the tradeoff drives the decision of when, where and how to adopt inter-domain QoS 
routing. 



4 Problems Faced when Introducing Inter-domain QoS Routing 

In this section, we present and discuss the problems faced when introducing inter- 
domain QoS routing into the Internet, and their possible solutions. 

Problem 1 : What kinds of QoS metrics might be used? 

Usually, a traffic flow could be characterized by a number of QoS metrics, e.g., 
bandwidth, delay, delay variation and data loss, etc. However, more than one metric 
used may lead to significant computation complexity, i.e., finding a route with two 
constraints is usually a NP-hard problem [3]. Therefore, only one metric (i.e., 
bandwidth) is preferably used because of simplicity. It is noticed that QOSPF adopts 
bandwidth as the only QoS metric [4]. 

On the other hand, requirements for QoS do not contradict other policy controls on 
route selection. For example, the best route might be selected from several QoS 
guaranteed or assured routes according to administrative policies. 

Problem 2: How is the resource information distributed and updated? 

Resource information of an AS usually include such items as the available 
bandwidth and the remaining buffer memory. Each AS monitors variations of 
available resources, and sends/receives resource information to/ffom other ASs. 
Several current inter-domain routing protocols might be used for this task. For 
example, based on the link state approach as in IDPR, resource information together 
with domain policies can be distributed and updated across domains. 

On the other hand, it is neither strictly required nor necessarily desirable for Inter- 
domain QoS routing to distribute and update resource information. This is because, 
first, although suitable inter-domain routing protocols (e.g., IDPR) have been 
presented for several years, they are not widely deployed in the practical networks 
yet; Second, with variations of available resources, distributing resource information 
across multiple domains might significantly increase the transmission overhead 
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especially in a very large network; Third, with the increase of ASs, it becomes more 
difficult to keep a consistent link state database and make consistent decisions on 
route generation and selection among ASs. 

Therefore, we present three of five mechanisms in section 6, i.e., mechanism 1, 2 
and 3, which can avoid distributing and updating resource information. To some 
extent, in these algorithms the function of distributing and updating resource 
information is carried out by signaling protocols. 

Problem 3 : What algorithms might be used for computing routes? 

Routing algorithms for finding the shortest path might be also used in inter-domain 
QoS routing, e.g.. Bellman Ford and Dijkstra or their modified versions. Moreover, a 
number of QoS routing algorithms presented recently are also possibly used if their 
computation complexity are acceptable in some cases[3]. 

Problem 4: What sorts of policy controls may be exerted on path computation and 
selection? 

Each AS sets its own routing polices such as access authentication, QoS and so on. 
These policies are mostly used in inter-domain QoS routing, too. 

Problem 5: How is the external routing information represented within an AS? 

For further study. We also need to understand how external information can be 
aggregated and how the frequency of resource availability changes can best be 
controlled so that the signaling overhead and the possibility of routing loops is 
minimized. 

Problem 6: How is the resource information stored? 

Resource information might be locally stored in each exterior gateway or globally 
stored in a routing agent of a number of exterior gateways. Either each gateway or an 
agent s topology database could possibly be enhanced to accommodate resource 
information of ASs. In order to achieve scalability, resource information in low-level 
ASs need to be aggregated to high-level ASs. 

Problem 7: What routing capabilities are needed (e.g., source routing, on-demand 
path computation)? 

When implementing an inter-domain QoS routing protocol, there are a number of 
options for computing routes, e.g., source routing vs. distributed routing, on-demand 
computation vs. pre-computation, etc. Source routing determines routes by the source 
AS while distributed routing computes routes by many ASs. For source routing the 
source AS should have the knowledge of topology and other global information. 
Distributed routing requires ASs to adopt the same routing algorithm. Both source 
routing and distributed routing need to keep the consistency of topology databases of 
different nodes. Otherwise, any discrepancy is likely to result in incorrect route 
computation and even in loops. 

Routes might be computed on demand or in advance, i.e., pre-computation. On- 
demand computation can obtain better efficiency in light-load requests and worse 
efficiency in case of heavy-load requests than pre-computation. In practice, QoS route 
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requests must be limited to a very low level in order to keep the network stable, 
espeeially considering the whole Internet. 

Problem 8: How is resource availability measured? 

If applications are adaptive, they tend to use whatever resources are available and 
as a consequence, congestion is the normal state of the network and normally no 
resources are available. We may try to measure how many users are creating the load. 
If it is only a few, under the assumption of adaptive applications, we can deduce that 
still a lot of resources are available. If the congestion is created by many users, we 
must assume that the congestion is real. 



5 An Inter-domain QoS Routing Model 



In this section, we present a model for inter-domain QoS routing and describe its 
building blocks. 




Fig. 1. An inter-domain QoS routing model 

As shown in Figure 1, this model is composed of three functional blocks (i.e.. 
Policy Control, Route Computation & Selection, and Routing Information 
Advertisement and Update) and three tables (i.e., topology database, flow table, and 
aggregated routing table). Policy Control exerts specified policies on finding routes 
and exchanging routing information. It might include source policies and transit 
policies, which are specified by the AS administrator. Moreover, these policies might 
be described by using the Routing Policy Specification Language (RPSL) to achieve a 
stable and analyzable internet routing [8]. 

Route Computation & Selection determines routes based on the knowledge of 
topology information and policy constraints. Routes are computed and saved into 
flow table for data forwarding. The flow table is used to store information related to 
specific flows, in terms of traffic parameters, requirements for QoS, etc. 

Routing Information Advertisement and Update is in charge of broadcasting 
routing information (e.g., resource information, policy constraints, routes selected, 
etc) and updating the local database when receiving routing information from other 
ASs. It is also responsible for broadcasting external routes to the interior routers and 
for aggregating routes. 
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6 Mechanisms for Operating Inter-Domain QoS Routing in 
DiffServ Networks 

As discussed in previous sections, introducing inter-domain QoS routing into the 
Internet might meet a number of problems. In this section, we present five 
mechanisms which will facilitate the development and deployment of inter-domain 
QoS routing. This work should be considered in connection with other works devoted 
to a QoS based Internet, that is. Differentiated Service, traffic engineering, etc. 

Differentiated Services is an architecture for building up a QoS based Internet. It is 
designed to offer QoS assurances without losing scalability. Meanwhile, Multi 
Protocol Label Switching (MPLS), which is regarded as one of the core switching 
technologies for the future Internet backbone, provides mechanisms for traffic 
engineering. The future IP networks are possibly DiffServ MPLS networks. On the 
other hand, with the enlargement of MPLS networks, it becomes necessary to 
consider routing among multiple domains. We present the DiffServ architecture with 
inter-domain QoS routing in the following subsection. 



6.1 DiffServ Architecture with Inter-domain QoS Routing 



Customer services 



Network operation, management and provision 



QoS signaling 



Intra-domain & Inter-domain QoS routing 




Fig. 2. DiffServ architecture with inter-domain QoS routing 

Figure 2 shows the DiffServ architecture. Requirements from customer services are 
clarified first using a number of QoS metrics. Then, network provider will provision 
the network resource for supporting these services. To maintain reliability and 
usability of the network, the network provider must perform network operation, 
management and provision. QoS signaling protocoF is needed to broadcast control 



^ Currently, there are two candidates for signaling protocols in the Internet, i.e., Constraint 
based Routing Label Distribution Protocol (CR-LDP) [9] and Extended Resource 
Reservation (ERSVP) [10]. 
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messages from network manager or to exchange interoperation information between 
network nodes. Intra-domain & inter-domain QoS routing selects routes for data 
transfer within and outside the DS domain. Data packets are classified and forwarded 
to the destination by the network nodes. 

In order to achieve consistent support of QoS, service level agreements (SLAs) 
should be achieved in advance. SLAs depict the agreements on flow specifications 
and QoS support between adjacent domains. 



6.2 Design Assumptions and Main Functions 

In this subsection, we first present some general assumptions: 

V A network node is a router; 

V A DS domain is an AS; 

V Intra-domain & inter-domain QoS routing computes routes for specific flows on- 
demand; 

V Intra-domain & inter-domain QoS routing protocols provide routes for best effort 
services in the same way as intra-domain & inter-domain routing protocols. 

Figure 3 illustrates the main functions and the procedures for setting up paths 
across domains. Signaling entity (SE) is a signaling agent of a DS domain, while 
routing entity (RE) is a routing agent of a DS domain running inter-domain QoS 
routing protocols. SE s functions include outgoing and incoming parts. The outgoing 
part collects QoS requests from interior routers and determine to initiate path setup 
requests; The incoming part processes path setup requests from other SEs. SE queries 
its local RE for external routes, and RE replies SE with next hops or whole routes. 
Note that the path setup request message usually contains the specifications of the 
flow and the requirements for QoS. 




Fig. 3. Setting up paths across domains 



6.3 Mechanisms for Operating Inter-Domain QoS Routing in DiffServ 
Networks 

First, we should note that the following mechanisms mostly omit the part of route 
computation (e.g., how to compute next hops or whole routes and how to distribute 
resource information, etc). Instead, these mechanisms mainly focus on the functions 
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of SEs related to REs. This is because of the fact that at present no inter-domain QoS 
routing protocol is available and the implementation of such protocol is unclear yet. 
Next, we present five mechanisms by using pseudo-codes. 

Mechanism 1 : SE based - crankback 

The pseudo-code of this mechanism is given in Figure 4. For simplicity, we just 
describe the procedures and interactions between SE and RE in transit domains. 



Mechaiufiti 1 : SE based - crankbaick 

If SE receives a path setup request message from upstream SE 
SE queiiss its local RE for next hop; 
if RE replies a blank next hop 

SE sends a path setup failure message to upstream SE; 

else 

SE checks its local resource information database; 
if there is enough avaliable resource to that hop 
SE adds itself to the route list of the path and 
sends the path setup request message downstream; 

else 

if SE has queried RE forK times for settingup this path 
SE sends apath setup failure message to upstream SE; 
else 

SE queries its local RE for neKt hop again; 
endif 
endif 

endif 

else 

if SE receive a path setup failure message from downstream SE 
if SE has queried RE for K times for settingup this path 
SE sends apath setup failure message to upstream SE; 

else 

SE queries its local RE for ne^t hop again; 

endif 

endif 

endif 



Fig. 4. Mechanism 1 : SE based - crankback 

When SE receives a path setup request message from an upstream SE, it first 
requests its local RE for next hop. If RE replies a non-blank next hop, SE checks if 
there is enough available resource on the link to that hop. If yes, SE adds itself to 
route list of the path and sends a request to that hop. If no, it requests the local RE for 
next hop again. If SE has queried RE for K times, SE sends a path setup failure 
message upstream. Here, W is a given constant. If SE receives a path setup failure 
message from downstream SE, it also requests its local RE for next hop again. A 
feasible route will be found until the request reaches the destination. In this case, 
resource reservation is proceeded downstream. 

This mechanism does not require RE to understand the global resource 
information, that is, there is no need for global topology and resource information 
database. As a result, advertising and updating resource information can be avoided. 
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The current inter-domain routing protocol (i.e., BGP) can be directly introduced into 
the DiffServ networks, except minor modifications on interface with SE. 

On the other hand, this mechanism has a few obvious disadvantages. For example, 
it might take a long time to find feasible routes because of crankback. This method 
also increases the overhead of SE, i.e., processing and transmission of signaling 
messages. 

Mechanism 2: SE based flooding 

This mechanism is a modified version of the first mechanism. It is designed to 
shorten the time for searching feasible routes by using flooding. That is, RE replies 
SE with all possible next hops and SE then floods requests to all possible next hops 
after checking its local resource information database. It should be noted that SEs do 
not need to reply the previous SE whether they find next hops or not. If some SEs fail 
finding feasible next hops, they just simply discard requests. A feasible route will be 
found until the request reaches the destination. However, if the destination SE 
receives multiple requests with various routes from the same source SE, it is 
responsible for selecting only one route and discard others. Then, it sends reply 
message and proceed resource reservation upstream. 

Also, this mechanism does not require each node to maintain a global topology and 
resource information database. 

The pseudo-code for SE in transit domains is given in Figure 5. 



Mechanism 2: SE based - floodiiig 

If SE leceives a path setup request message from upstream SE 
SE queries its local RE for all possible next hops; 
if RE replies a blank next hop 
return; 

else 

SE checks its local resource information database, 
if there is enough avaliable resource to any of these hops 
SE adds itself to the route lists of the path and 
floodsthe path setup request messages to each of these possible hops; 
endif 
endif 

endif 



Fig. 5. Mechanism 2: SE based flooding 

This algorithm is expected to improve routing efficiency with the tradeoff on 
increasing signaling overhead. 

Mechanism 3 : Cache based Routing 

This mechanism is an enhancement to the above two mechanisms. Figure 6 shows 
its pseudo-code. 
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Mechanism 3: Cache based muting 

If SE needs to setup a path 

SE looks up muting cache first 
if there are availabe mutes in the cache 

SE checks their feasibhty and updates the cache; 
if there is any feasible route, 
use the route; 
else 

SE queries RE for routes and updates the cache ; 

endif 

else 

SE queries RE for routes and updates the cache; 

endif 

endif 



Fig. 6. Mechanism 3 : Cache based 

SE caches success and failure information on next hops. Therefore, subsequent 
requests adopt previously suceessful routes and avoid previously unsuccessful routes. 

Mechanism 4: Z)-hop resource routing 

As shown in Figure 7, REs advertise resource availability information with TTL 
(Time To Live) set to D depth of resource availability dissemination, where Z) is a 
small integer indicating the maximum number of hops resource availability 
information is distributed. Each node calculates only the next hop taking into account 
not only the local resource availability information but also information until the 
depth of D. Path vector information is used for loop avoidance. The parameter L in 
this meehanism limits the attempting times for searching feasible next hops. 



Mechanism 4: D-hop lesoume muting 

If SE leceive; a path setup lequest message fmmupstieam SE 
do while next hop in Path-so-far and tiies < L 

SE queiies local RE for path using lesoume info until D increment tries 
enddo 

endif 



Fig. 7. Mechanism 4: D-hop resource routing 

This mechanism could also be combined with Cache based QoS routing in 
mechanism 3. 

There are issues with this algorithm including 

V the frequency of resource availability information updates and consequently 
frequency of recalculations of the routing tables; 

V how to measure resouree availability. 
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Regarding the last bullet, we suggest to measure only the use of resources by one 
or two highest traffic classes. If applications have fixed resource requirements, this 
algorithm should give a good spread of high quality traffic across the network and 
should help to avoid creating hot spots of traffic in the network. If the applications 
that require using a high traffic precedence class are elastic, measuring resource usage 
becomes more complicated and a more intelligent measurement and filtering 
mechanism is required. 

Mechanism 5 : RE based source routing 

In this mechanism, source SE requests its local RE for whole routes to destination. 



Mechauhm 5: RE Rased sounce ratitiug 

If source SE decides to setup a path 

SE queiies its local RE for the whole route, 
endif 



Fig. 8. Mechanism 5: RE based source routing 

This mechanism is suitable in case of using IDPR. However, the current IDPR 
should be extended to accommodate broadcasting resource information and updating 
global topology database. 



6.4 Considerations on Deployment 

As described above, the first three mechanisms have less requirements for routing 
protocols. These mechanisms do not care about the detailed implementation of 
routing computation, so that they can transparently support the mostly widely used 
inter-domain routing protocol, i.e., BGP. In these mechanisms, the QoS routing 
decisions are made by SEs instead of REs. Since SEs naturally provide functions 
related to support of QoS, the first three mechanisms can greatly facilitate the 
deployment of routing across multiple domains and next improve the support of QoS. 

On the contrary, the last two mechanisms, especially mechanism 5 likely rely on 
the development of inter-domain QoS routing itself IDPR is a candidate of inter- 
domain QoS routing. The inter-domain QoS routing protocol is responsible for 
advertising resource information and determining routes mostly according to the 
network resource information and QoS requirements. These two algorithms possibly 
provide better support of QoS through directly finding QoS based routes. 

Whatever, the efficiency of the five mechanisms needs for careful verifications. 



7 Simulator 



In order to study constraint based routing as well as the mechanisms presented in the 
paper, we are currently devoted to developing a new routing simulator Extended 
QoS based Routing Simulator (EQRS)[11]. It is designed on the basis of DiffServ 





Mechanisms for Inter-domain QoS Routing in Differentiated Service Networks 165 



architecture, which consists of essential QoS related components, e.g., network 
management, signaling protocol, routing protocol, etc. Mechanisms presented in this 
paper are expeetedly implemented into EQRS. EQRS allows users to eonfigure 
parameters of DiffServ MPLS networks, where the dynamics of constraint based 
routing algorithms as well as traffic engineering mechanisms can be investigated. 
With the help of EQRS, our future works can focus on investigation and verifieation 
of these mechanisms. Also, this simulator is suitable for modeling, designing and 
evaluating DiffServ MPLS networks. 



8 Conclusions 

With the rapid growth of the Internet, inter-domain QoS routing is becoming an 
important topic for developing large QoS based IP networks. In this paper, we 
investigate problems faced when introducing inter-domain QoS routing into the 
Internet. We also present an inter-domain QoS routing model and five mechanisms 
for operating inter-domain QoS routing in DiffServ networks. These mechanisms are 
suitable for using the eurrent inter-domain routing protocols into DiffServ networks. 

On the other hand, there are still a large number of open researeh problems 
concerning QoS routing across domains, e.g., methods of flow aggregation, 
algorithms of advertising and updating resource information, etc. Our near future 
work will focus on design and implementation of the mechanisms presented in this 
paper. Our far future works will be devoted to studying those problems. 
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Abstract. The priority queueing mechanism is analysed to verify its ef- 
fectiveness when applied for the support of Expedited Forwarding-based 
services in the Differentiated Services environment. An experimental 
measurement-based methodology is adopted to outline its properties and 
end-to-end performance when supported in real transmission devices. A 
test layout has been set up over a metropolitan area for the estimation 
of one-way delay and instantaneous packet delay variation. 

The effect of relevant factors like the buffering architecture, the back- 
ground traffic packet size distribution and the EF traffic profile are 
considered. In particular, the complementary one-way delay probabil- 
ity function is computed for a given packet size distribution and the 
Aggregation Degree parameter is defined to quantify the effect of traffic 
aggregation on end-to-end QoS.^ 

Keywords: Priority Queueing, Differentiated Services, Expedited For- 
warding, Performance measurement. One-way delay. Instantaneous packet 
delay variation 



1 Introduction 

The Differentiated Services framework has been recently considered by the sci- 
entific community as a scalable solution for the support of Quality of Service 
to time-sensitive applications. In the Differentiated Services architecture (diff- 
serv) [1,2] traffic is classified, metered and marked at the edge of the network 
so that streams with similar requirements are placed in the same class, i.e. are 

^ This work has been partially funded by M.U.R.S.T, the Italian Ministry of University 
and scientific and technological research, in the framework of the research project 
Quality of service in multi-service telecommunication networks (MQOS). This work 
is a joint activity carried out in collaboration with the TF-TANT task force in the 
framework of the EC-funded project QUANTUM[17,18,19]. 
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marked with the same label -the Differentiated Services Code Point (DSCP) [3]-, 
and treated in an aggregated fashion so that no per-ffow differentiation is re- 
quired. 

QoS guarantees are applied to the class on the whole instead of the single 
streams. Class differentiation is achieved through queueing, which is placed at 
critical points of the network so that datagrams with identical DSCP are stored 
in the same queue. Queues are drained according to the service order defined 
by the scheduling algorithm adopted by the queueing system [4] . Several packet 
treatments - the so-called Per-Hop Behaviours (PHBs) - have been standardized 
so far: the Expedited Forwarding PHB (EF) (RFC 2598) for the support of de- 
lay - and jitter-sensitive traffic and the Assured Forwarding PHB Group (AF) 
(RFC 2597) for the differentiation into relative levels of priority. The traditional 
Best-Effort (BE) packet treatment is an additional valid PHB. 

The experimental approach to the problem of delay and jitter offers the 
challenge of verifying the influence of scheduling [5,6], one of the main QoS 
building blocks, on end-to-end traffic performance when applied in a production- 
like environment. 

In this paper we study the performance of a specific queueing algorithm: Pri- 
ority Queueing (PQ), when applied to delay- and jitter-sensitive traffic. Several 
test scenarios are considered: end-to-end performance is analysed as a function 
of the background traffic pattern and of the priority traffic characteristics like 
the frame size, the number of concurrent flows and their profile. 

The goal is to identify the requirements for an effective deployment of PQ when 
applied to the Expedited Forwarding PHB: We derive a general law describ- 
ing the queueing delay introduced by PQ under different traffic profiles and we 
evaluate the PQ nodal jitter as a function of the traffic profile when several EF 
streams run concurrently. 

Section 2 introduces the network testbed deployed for end-to-end perfor- 
mance measurement, while in Sections 3 and 4 we provide a high-level description 
of the queueing system under analysis and of the measurement methodology and 
metrics adopted in this paper. In Section 5 focuses on the nodal delay introduced 
by PQ in presence of different best-effort traffic scenarios, while in Section 6 we 
develop the details of the effect of EF aggregation on both one-way delay and 
instantaneous packet delay variation. The EF packet size, an additional impor- 
tant factor, is evaluated in Section 7 and the article is concluded by Section 8, 
in which we summarize the main achievements here discussed. 



2 Network Layout 

A metropolitan network based on 2 Mbps ATM connections was deployed as il- 
lustrated in Figure 1. Packet classification, marking and policing are enabled on 
router C7200 ^ (experimental lOS version 12.0(6.5)T7) and PQ is the scheduling 

^ No shaping of IP traffic was applied in the test scenarios analysed in this paper. 
However, ATM shaping was supported in router C7200. The study of the impact of 
shaping on delay and jitter is subject of future research. 
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algorithm configured on its output interface. The first router C7200 on the data 
path is the only one that enables QoS features, while the two remaining routers 
C7500 are non-congested FIFO devices introducing a constant transmission de- 
lay which is only a function of the EF packet size. The two routers C7500 have 
no influence on jitter, since the minimum departure rate at the output interface 
is always greater or equal to the maximum arrival rate, as a consequence the 
input stream profile is not distorted when handled by the C7500s. 

The round trip time (RTT) of a 64- byte packet is approximately 2 msec, but 
RTT linearly increases with the packet size. 

The SmartBits 200 by Netcom Systems is deployed as measurement point and 
traffic generator, while EF and BE background traffic is introduced by test work- 
stations both to congest the egress ATM connection of router C7200 and to create 
aggregation when needed. Both EF and BE traffic are received by the C7200 from 
the same interface and they share the same data path to the destination. The 
SmartBits 200 is a specialized platform which performs packet time-stamping in 
hardware with a precision of 100 nsec and it is capable of gathering measures for 
a large range of performance metrics. 

Sources and destinations are all located in Site 1 and connected through a 
switched Ethernet. Both EF and BE traffic are looped back to Site 1 so that 
the measurement point can be deployed as source and receiver at the same time. 
In this way accuracy in one-way delay measurement is not affected by clock 
synchronization errors. 




Fig. 1. Diffserv test network 



3 Diffserv Node Architecture 

In this section we present the queueing architecture of the ingress diffserv node 
whose performance is the subject of our analysis. The system can be represented 
as the combination of a queueing system coupled with a FIFO transmission queue 
(Figure 2). 
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Queueing system It represents the diffserv nodal component traffic differentia- 
tion relies on. In this study we assume that it is based on two scheduling algo- 
rithms: Priority Queueing and Weighted Fair Queueing [13,14,15]. We restrict 
the analysis to a single-level Priority Queueing model, in which only one priority 
queue is present, while the Weighted Fair Queueing system can be composed of 
one or more different queues. Packets are distributed among queues according 
to the label carried in the IP header, which identifies the Behaviour Aggregate 
(BA) the packet belongs to. 

For simplicity we restrict our analysis to a scenario based on only two per- 
hop behaviours: the Expedited Forwarding PHB (EF) and the Best- Effort PHB 
(BE). PQ is used to handle delay- and jitter-sensitive traffic, i.e. EF traffic, 
while background BE traffic is stored in a WFQ queue. Nevertheless, results 
are generally applicable to a WFQ system with multiple queues, since PQ is 
independent of the number of WFQ queues by definition of the algorithm. 
Despite of the above-mentioned restrictions, the WFQ system can be composed 
of an arbitrarily large number of queues. The combined presence of a PQ queue- 
ing module and of a WFQ queueing module in an output interface gives the 
possibility to support at the same time services for delay and jitter-sensitive 
traffic as well as services for loss, delay and jitter sensitive traffic. 

Queues handle aggregated traffic streams, not per-flow queues, in other terms, 
each queue in the queueing system illustrated in Figure 2 collects packets from 
two or more micro- flows. Aggregation is a key feature in diffserv networks intro- 
duced to increase the scalability of the architecture. 

The length of the priority queue is limited in comparison with the BE queue: the 
priority queue length is limited to 10 packets, while the best-effort queue length 
is set to the maximum allowed by the system, i.e. 64 packets. The availability of 
queue space is not relevant in a high-priority queue since the instantaneous ac- 
cumulation of data is avoided through policing and shaping in order to minimize 
the corresponding queueing delay. 

Transmission queue It is an additional buffering stage introduced to capture 
the common architecture of many transmission devices in which the line adapter 
is separated from the queueing block by an internal transmission element, e.g. 
a bus. The transmission queue gathers packets issued by the queueing system 
according to the order defined by the scheduler and it is emptied at line rate. In 
this study we assume that the transmission queue is serviced in a FCFS fashion 
as this is the service policy commonly adopted for production devices. 

Given the relatively small rate of the wide area ATM connection (2 Mbps), the 
time needed to dequeue a packet and to place it in the transmission queue is 
assumed to infinitely small in comparison with its transmission time. In this 
paper the set of WFQ and PQ queues was configured on an ATM interface and 
the transmission queue is emptied at line rate, i.e. at the PVC rate, which is 
2 Mbps. 

Memory in the transmission queue is allocated so that as soon as one unit is 
freed, the unit can be immediately allocated to store a new packet, the one 
selected by the first queueing stage for transmission. If the memory unit size is 
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not enough to store the whole packet, then, additional units can be temporarily 
added to the transmission queue so that the packet can be entirely stored. This 
means that if the transmission queue is 5 unit long and the MTU size is 1500 by, 
then the maximum instantaneous queue length is 7 units when the queue stores 
two best-effort packets and 1 EF packet. 




Fig. 2. Architecture of the diffserv node queueing system 



The original contribution of this paper comes from the fact that two queueing 
algorithms (PQ and WFQ) are coupled with a finite-length and discrete trans- 
mission queue like in real-life network systems and from the fact that different 
traffic models are used at the same time to feed different queues. 

Analytic studies of both PQ and WFQ can be found in literature [7,8]. 

4 Measurement Methodology 

The priority queueing algorithm is evaluated by focusing on two metrics: one- 
way delay and instantaneous packet delay variation (IPDV). The two above- 
mentioned parameters were selected to verify the suitability of PQ as scheduling 
algorithm when applied to the EF PHB. 



One-way Delay is defined in RFC 2679. This metric is measured from the wire 
time of the packet arriving on the link observed by the sender to the wire time 
of the last bit of the packet observed by the receiver. The difference of these 
two values is the one-way delay. In our experimental scenario one-way delay is 
derived from the cut-through latency measured by the SmartBits 200 according 
to RFC 1242. 

Instantaneous Packet Delay Variation is formally defined by the IPPM working 
group Draft [9]. It is based on one-way delay measurements and it is defined 
for (consecutive) pairs of packets. A singleton IPDV measurement requires two 
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packets. If we let Di be the one-way delay of the packet, then the IPDV of 
the packet pair is defined as Di — Ui_i. 

According to common usage, jitter is computed according to the following for- 
mula: jitter = \Di — 

In our tests we assume that the drift of the sender clock and receiver clock is 
negligible given the time scales of the tests discussed in this article. In the fol- 
lowing we will refer to jitter simply with ipdv. 

It is important to note that while one-way-delay requires clocks to be synchro- 
nized or at least the offset and drift to be known so that the times can be 
corrected, the computation of IPDV cancels the offset since it is the difference 
of two time intervals. If the clocks do not drift significantly in the time between 
the two time interval measurements, no correction is needed. 

One-way delay and IPDV are analysed by computing their frequency distri- 
bution over a population of 10000 samples. 

When several EF concurrent streams are run in parallel, performance measure- 
ment is only applied to one stream, which in this paper is referred to with the 
term reference stream. Such a stream is generated by the SmartBits through 
the application SmartWindows 6.53, while additional EF flows are generated 
through the application mgen 3.1 [10]. 

The load of the EF class is kept limited to a small fraction of the line rate (32%). 
Both the priority queue size and the transmission queue size are constant: the 
former is equal to 10 packets, the latter to 5 memory units. Both EF and BE 
streams are constant bit rate, unidirectional UDP flows. 

While in principle EF traffic can be both UDP and TCP, we restrict our analysis 
to UDP streams because we want to study a queueing system which does not 
include traffic conditioning modules (like shapers and/or policers), which are 
needed in case of bursty traffic. In this paper we assume that input traffic is 
correctly shaped and does not exceed the maximum EF reservation. 

In an ideal system end-to-end performance of a flow belonging to a given 
class should be independent of the traffic profile of both its behaviour aggregate 
and of other behaviour aggregates present in the queueing system. 

However, test results show that one-way delay experienced by packets subject 
to priority queueing is influenced by three main factors: 

1. The packet size frequency distribution of background traffic, 

2. The instantaneous length of the priority queue, 

3. The EF packet size. 

In the former case the packet size has an influence on the queueing delay intro- 
duced by both the transmission queue and the priority queue itself. In fact, an EF 
packet has to wait for the completion of the current packet transmission before it 
is selected next by the scheduler for transmission. However, also the profile of the 
reference behaviour aggregate can impact the end-to-end performance: In pres- 
ence of burstiness (for example stemming from the packet clustering introduced 
by aggregation) the priority queue length can be instantaneously non-zero. As 
a consequence the nodal queueing delay introduced by the priority queue can 
be different depending on the position of an EF packet within a burst. In the 
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following sections the impact of the above-mentioned factors on one-way delay 
is analysed in detail. 

5 Packet Size Distribution of Best-Effort Traffic 

In this section EF one-way delay is evaluated in presence of different background 
BA packet size distributions: deterministic and real (when background traffic is 
generated according to a real packet size distribution determined through the 
monitoring of production traffic) . 

The test characteristics are summarized in Table 1. In order to reduce the com- 
plexity of our analysis only two BAs run in parallel: a best-effort BA composed 
multiple best-effort micro-flows, each issuing data at constant rate, and a single 
constant bit rate EF flow, serviced by the priority queue. 



Table 1. One-way delay test parameters with different background traffic pro- 
files 



EF traffic 


BE traffic 


Load (Kbps) 


Frame Size (bytes) 


Prot 


Load (Kbps) 


Frame Size Distribution 


Prot 


300 


128 


UDP 


> 2000 


Deterministic, Real 


UDP 



5.1 Constant Packet Size 

To begin with, an ideal test scenario has been considered in which the BE traffic 
is characterized by multiple streams each issuing packets of identical constant 
fixed payload length in the range: [100, 1450] by. This case study is of interest to 
show the importance of the background packet size in presence of transmission 
queue-based systems (Figure 2), in which the queue space is allocated in units 
of fixed length. This length is equal to 512 by in this paper. The EF packet 
always spans a single memory unit, while the BE packet can occupy one or more 
depending on its size. 

One-way delay is made of several components: 

1. PQ waiting time: the time which elapses between arrival time and the time 
the packet moves from the first queueing stage to the FIFO transmission 
queue. With PQ the maximum waiting time corresponds to the transmission 
time of a MTU best-effort packet. 

2. FIFO queueing time: it represents the time needed wait until a packet is 
dequeued from the FIFO transmission queue. It depends on the maximum 
length of the transmission queue, which in our test scenario is equal to 7 
memory units of 512 by. 
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3. transmission time: the time needed to place the packet on the wire, it is a 
function of the line speed (2 Mbps) and of the packet size. The end-to-end 
transmission time is additive and depends on the number of hops. 

As Figure 3 shows, the one-way delay sample mean curve is non-monotone: the 
discontinuity points correspond to specific BE payload sizes, namely 485 and 
1000 by, i.e. for integer multiples of the transmission queue memory unit size, as 
configured in this test (512 by). 



EF Mean One Way Delay vs BE Packet Size 




Be IP payload size 



Fig. 3. One-way delay sample mean for different BE payload packet sizes 



The overall increase in one-way delay is due to the small EF rate (15% of 
the line capacity) in comparison with the BE rate. When an EF packet arrives 
into the priority queue and a BE packet is under transmission, the EF waiting 
time varies in relation with the BE packet length, being the transmission queue 
occupancy characterized by a miz of BE and EF packets. In any case if a single 
memory unit is allocated to best-effort traffic and the BE packet size increases, 
the average queueing delay introduced by PQ increases, with a maximum when 
a BE datagram completely fills the memory unit. 

However, if the packet size is such that a single memory unit is not suffi- 
cient to store it, an additional partially filled unit has to be allocated to store 
the remaining part of the BE packet. As a consequence, queue memory gets 
fragmented as completely filled units are followed in the queue by non-complete 
ones. The time needed to empty the transmission queue is less and the average 
EF queueing delay introduced by the transmission queue decreases. After a min- 
imum in the curve, i.e. when the BE packet length further increases, the second 
memory unit increasingly gets more completed and the fraction of queue space 
allocated to BE packets becomes greater with a consequent increase in delay. 

A worst case evaluation of the average EF one-way-delay has been performed 
using the general model for priority queueing presented in [7] . By assuming all 



Priority Queueing Applied to Expedited Forwarding 175 



units in the transmission queue completely full the local maxima of Figure 3 are 
validated by the delay formula presented in [7]. 

5.2 Real BE Packet Size Distribution 

While in the previous section the analysis focuses on statistical distributions of 
the BE packet size, in this section performance is estimated when the BE packet 
size is modelled according to the real cumulative distribution plotted in Figure 4. 
In what follows we call it the real distribution. 

We computed this frequency over a population of more that 100 billion of 
packets belonging to traffic exchanged on an intercontinental connection in both 
ways. For the following test scenarios it was chosen as reference distribution 
to model packet size according to the pattern of traffic exchanged at typical 
potential congestion points. 

In this study, packet size is the only traffic parameter which is considered for real 
traffic emulation, since we focus on the worst-case scenario in which the queueing 
system under analysis described in section 3 is assumed to be under permanent 
congestion. As such, rate variability and autocorrelation of best-effort traffic are 
not considered, even if they can influence the performance of the system when 
deployed in production. 




Fig. 4. Real cumulative BE packet size distribution 



Figure 5 plots the complementary probability distribution we derived from the 
frequency distribution computed experimentally during the test session, i.e. the 
probability that the delay d experienced by a packet is greater than a given 
value D {p{d) > D). We express the variable delay in transmission units, i.e. 
in integer multiples of the transmission time of an EF packet of known size 
at a given line rate (for an EF payload size of 128 by it corresponds to 0.636 
msec). Figure 5 shows that in the system under analysis we can assume that the 
probability that one-way delay is greater than 36 transmission units is negligible. 
This threshold can be adopted as an upper bound of the playout buffer size 



176 



Tiziana Ferrari et al. 



used by interactive multimedia applications and is useful for the optimization of 
memory utilization. 




Fig. 5. EF complementary delay probability for a real BE packet sizes distribu- 
tion 



6 Performance with EF Aggregation 

Aggregation is one of the fundamental properties which characterize the differ- 
entiated services architecture and we want to estimate its impact to verify in 
which cases and to which extent the differentiated services can provide effec- 
tive end-to-end services as opposed to the integrated services, which can provide 
per-flow performance guarantees through signalling. 

In the previous scenarios PQ performance was analysed under the assumption 
that at any time its occupancy is not greater than one packet (represented by the 
datagram waiting for the end of the current BE transmission). This hypothesis 
holds only when input traffic is evenly shaped. However, in presence of bursty 
traffic the nodal delay introduced by the priority queue becomes non-negligible. 
Burstiness can stem from traffic aggregation, i.e. from packet clustering, which 
occurs when packets of the same BA arrive at around the same time from differ- 
ent input interfaces and are destined to the same output interface. Even a single 
source injecting multiple streams can be a potential source of burstiness since 
independent streams are not synchronized and they can produce instantaneous 
bursts of packets, even if each single stream is perfectly shaped. 

Test results confirm that in this case the priority queue size can instanta- 
neously hold two or more packets and performance depends on the percentage 
of traffic injected by a given source and by its size. Results show that in absence 
of shaping and policing aggregation can propagate through a wide area network 
by increasing the burst size step by step in an avalanche fashion, as also con- 
firmed by the simulation results presented in [11]. 
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In this test scenario aggregation is produced by injecting several EF streams 
from different interfaces. 

The reference stream load decreases from 300 Kbps (when the EF reference 
stream is present) to 50Kbps (when several EF streams are run in parallel). 
The test conditions are summarized in Table 2. We introduce a new metric: the 
Aggregation Degree (which we reference with letter A) to quantify the amount of 
bandwidth shared among different streams within a BA: A = 1— ^£"^2 ■ where Imax 
is the maximum load of a micro-flow and Lb a the overall BA load, i.e. the total 
traffic volume generated by streams belonging to the class). A is equal to 0 if 
just a single stream is active, while it is close to 1 when only the BA load is 
divided among many tiny flows. 

In this test scenario a decreasing load injected by the reference stream in- 
dicates that A increases, in fact competing EF streams consequently issue a 
greater amount of traffic and this implies a higher packet clustering probability. 
BE traffic is modelled according to the real distribution described in Par. 5.2. 



Table 2. Test parameter for IPDV under different aggregation patterns 



EF traffic (UDP) 


BE traffic (UDP) 


BA Load 
(Kbps) 


Number 
of streams 


Ref. stream 
load 


Ref. stream 
frame size 


BA Load 
(Kbps) 


Number 
of streams 


Frame size 
distribution 


300 


variable 


[50, 300] Kbps 


128 by 


> 2000 


20 


real, [0,1500] 



6.1 One-Way Delay 

Delay distributions measured in presence of multiple EF flows but with constant 
BA load show that when the aggregation degree A increases, the delay distribu- 
tion gets more spread giving rise to greater delay variability and larger average 
one-way delay values. In particular for A equal to |, |, | and 0, the average 
delay is equal to 18.29, 17.66, 17.23 and 16.75 msec respectively. 

Thus, we can conclude that also the aggregation degree is a design parameter 
that needs to be upper bounded in order to achieve acceptable performance. 
This is important during the admission control phase. 

In Figure 6 the complementary probability distributions derived from the 
above-mentioned frequency distributions are plotted. The graphs show that for 
a given delay D the probability significantly varies with the aggregation degree A. 
If stringent design requirements need to be met, then the number of flows must 
be bounded through admission control. 

The impact of the EF stream on burstiness depends on the number of hops 
as explained in [12]. In fact, the presence of a single aggregation point limits the 
burst accumulation phenomenon, which is visible in multi-hop data-path with a 
chain of aggregation and congestion points. In addition, the presence of a short 
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EF queue (10 packets in this test scenario) limits the maximum instantaneous 
burst to 1280 by which corresponds to a maximum queueing delay of 5.12 msec. 




1 EF Stream 50 Kbps + 5 EF Streams 250 Kbps 1 EF Stream 100 Kbps + 5 EF Stream 200 Kbps 
1 EF Stream 200 Kbps + 5 EF Streams 1 00 Kbps -»«- 1 EF Stream 300 Kbps 



Fig. 6. Complementary one-way probability functions for different Aggregation 
Degrees 



6.2 IPDV 

IPDV frequency distribution curves were calculated for different aggregation 
patterns with a reference stream rate decreasing from 300 Kbps to 50 Kbps and 
a constant aggregate EF load equal to 300 Kbps. 

Distributions^ present two peaks which are presumably due to the real dis- 
tribution of the BE packet size, which is mainly concentrated around two packet 
size values. 

In this test the transmission time of a reference packet is 0.636 msec and the 
maximum IPDV was 30 times the transmission time of the packet itself, inde- 
pendently of the aggregation pattern. However, for a higher value of A IPDV 
decreases more rapidly and is more densely distributed around a small value. 
High aggregation degrees produce an increase in IPDV, in fact for an aggrega- 
tion degree A equal to |, ^ and 0 the average IPDV value is equal to 6.05, 5.16 
and 3.63 msec respectively. 

7 EF Packet Size 

IPDV increases with the EF packet size. In this test only one EF stream is 
generated (A is equal to 0) and used as reference stream. The frame size (i.e. 
the EF IP packet size plus the layer 2 overhead is constant for a given test, and 
varies in the range: [128, 512, 1024] by as indicated in Table 3. IPDV frequency 

® For more information about these distributions and graphs we refer to the long 
version of this paper. 
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Table 3. Test parameters for IPDV with different EF frame sizes 



EF traffic (UDP) 


BE traffic (UDP) 


BA Load 
(Kbps) 


Number 
of streams 


Ref. stream 
load 


Ref. stream 
frame size 


BA Load 
(Kbps) 


Number 
of streams 


Frame size 
distribution 


300 


1 


300 Kbps 


128, 512, 1024 by 


> 2000 


20 


real, [0,1500] 



distribution is better for smaller EF frame sizes as illustrated by the IPDV 
frequency distribution curves in Figure 7: For example with frame size equal to 
128 by, 60% of the IPDV values are concentrated in a single interval. 

The standard deviation increases with the packet size: for packets of 128, 512 
and 1024 by it takes the values: 2016.9, 2500.3 and 3057.0 respectively. 




ipdv bins (microsec) 

[-•- EFframe size 128 by "EF frame size 512 by" -*-"EF frame size 1024 by"| 



Fig. 7. IPDV frequency distribution vs EF frame sizes (BE size distribution: 
real) 



8 Conclusions and Future Work 

This paper provides an in depth measurement-based analysis of the Priority 
Queueing algorithm when adopted for the support of end-to-end QoS to delay- 
and jitter-sensitive applications, which require EF-based services. Experimental 
results show that the end-to-end performance of the queueing system is strongly 
related to the system components - for example the additional FCFS buffering 
stage (the transmission queue) - as well as to traffic-related factors like the 
traffic pattern of active BAs and the traffic profile of the EF stream itself. While 
Priority Queueing proved to be one of the most effective algorithms to minimize 
queueing delay, a careful system design should be adopted in order to provide 
strict one-way delay and IPDV guarantees. 
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The transmission queue introduces an additional contribution to the nodal 
delay produced by the priority queue, a delay that is proportional to the average 
packet size in the background BAs. As such, the transmission queue size should 
be limited. 

The one-way delay distribution of EF traffic is strongly dependent on the back- 
ground traffic distribution and in this study we have compared two cases : the 
uniform and the real distribution. Generally speaking for larger background traf- 
fic packet sizes the one-way delay standard deviation increases as values are more 
spread over a large range. For a given distribution and a given EF packet size 
the complementary delay probability can be computed for delay estimation in 
complex network scenarios and system design purposes. 

Background traffic profile is not the only relevant factor: Both the flow and 
the BA profile can impact performance. 

Firstly, the average packet size of a flow is such that for larger datagrams IPDV 
standard deviation increases. In the second place, stream aggregation within a 
class has an influence on both the end-to-end one-way delay and IPDV experi- 
enced by each flow. In this paper we define the Aggregation Degree parameter 
to describe the load partitioning between flows within a class and we express 
performance as a function of it. A high aggregation degree produces an increase 
in average one-way delay and IPDV. As a consequence, in case of stringent de- 
lay and jitter requirements, aggregation has to be limited. The dimensioning of 
aggregation is subject of future research: The effect of the number of streams, 
of the nominal rate and of packet size will be investigated. 
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Abstract. The Internet is developing from a data network supporting 
best-effort service to a network offering several classes of service. An 
operator of an IP-network should monitor the achieved QoS for each 
class using suitable measurement points and measurement methods. The 
paper compares briefly IPPM and 1.380 IP-level QoS (Quality of Serv- 
ice) -parameter definitions and points out their problems: the selection 
of the measurement points and the use of test traffic based measure- 
ments. The paper gives more suitable definitions for QoS and GOS 
(Grade of Service)-parameters for the IP level, discusses measurement 
issues of loss and transfer rate, proposes and illustrates via an example a 
network management based QoS-monitoring system. 

1 Introduction 

Building QoS in the Internet is one of the topics of current interest. New traffic man- 
agement methods in the Internet: the Integrated Services [2,3,7], the Differentiated 
Services [4, 5, 6, 7] and other QoS-architectures [1] may enable provisioning several 
classes of service with different QoS-levels. The methods combine classification, 
priority mechanisms and reservation of bandwidth. There is currently much work on 
traffic characterization of traffic in the Internet, like [8], which is useful in the classifi- 
cation and management of traffic. However, the QoS-management methods require 
methods e.g. for QoS-monitoring before operators can build new services on them. 
There are new definitions for QoS-parameters for monitoring purposes [9-13] and 
projects, where traffic management and QoS-monitoring is developed [14-15,20]. The 
recommendations of ITU-T for QoS/GOS-monitoring [16,18,19] may also be relevant 
for IP-traffic. This paper, partly done for [20]*, partly in internal projects, discusses 
QoS-monitoring issues in the IP. 



* The participants in the EURESCOM-project Quasimodo are CSELT, Deutsche Telecom, 
HTE, OTE, Portugal Telecom, Broadcom, British Telecom, Telecom Sweden, Telenor and 
the Finnet Group 

J. Crowcroft, J. Roberts, and M. Smirnov (Eds.): QoflS 2000, LNCS 1922, pp. 182-193, 2000. 

! Springer-Verlag Berlin Heidelberg 2000 
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2 Measurement Reference Points 

The subscriber access to the Internet can be achieved through various access methods 
where the traffic characteristics could be drastically different. In Figure 1 we have 
illustrated some reference connections, which differ in traffic aspects and should be 
considered while developing QoS-measurement methods. The first reference connec- 
tion is a switched connection, typically through a low bit-rate end-user access (e.g. 
ISDN). In this access method the capacity and characteristics of the link have an effect 
on the performance and there is blocking and queuing in the Point of Presence. In the 
second reference connection the end-user has a wireless access (e.g. GPRS). The 
mobile network has a major influence on the user perceived QoS. The last reference 
connection shows intranet connections over public lines (e.g. IP-VPN). In Figure 1 the 
measurement reference points (MRPs) are marked with P. MRPs are situated in one 
ISPs (Internet Service Provider) network, not in the end systems as in the IETF IPPM 
definitions. This requires changes in the definitions of QoS/GOS parameters and re- 
sults in different measurement methods. The selection of MRPs agrees with the 
EURESCOM-project Quasimodo definitions [20], but [20] only consider LAN access 
to the network. 




Figure 1. Measurement points in a real IP-network 



3 Comparison of Existing QoS/GOS Parameter Definitions 

This chapter outlines the difference of available standardized IP QoS parameter defi- 
nitions. Currently there exists two QoS parameter definitions; one made by IETF 
IPPM and another by ITU-T. Table 1 compares these two approaches. 

IETF has created a framework for IP QoS metrics [9] and also has defined measure- 
ment methods for delay [10], delay variation (jitter) [11] and loss measurements [12]. 
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In [10,1 1] the IP traffic is divided into couple of packets and streams, for which own 
definitions (either pair of measurement packets or stream traffic) are given for meas- 
uring one-way-delay and delay variation between two measurement points. In [10] 
also one-way IP packet loss metrics are defined. In [12] two loss-related metrics are 
defined that try to identify loss pattern characteristics. ITU-T has defined recommen- 
dation 1.380 [13] in which such parameters are defined that can be used to measure the 
performance of speed, accuracy, dependability, and availability of IP packet transfer. 
Here IP service performance is considered in the context of ITU-T 3x3-performance 
matrix defined in ITU-T recommendation 1.350. ITU-T has set IP packet transfer 
parameters (Table 1). Both IPPM and 1.380 propose measurements of QoS by test 
traffic. IPPM has the measurement points at the end systems while using 1.380 the 
measurement points can be in the MRP -points of Figure 1. Measuring QoS in Figure 1 
between two MRPs (P = B and P = C) by test traffic means injecting test packets of a 
given QoS-class with a predetermined arrival process at MRP B with the destination 
address passing MRP C. Values of QoS-parameters are estimated from QoS of the 
test packets. 



Table 1: IPPM vs. ITU-T 1.380 



QoS/GOS met- 
ric 


IPPM 


ITU-T 1.380 








Delay 


One-way-delay 

One-way-delay-stream 


IP Packet Transfer one-way-Delay 
Mean IPTD (one-way) 


Delay variation 


One-way-ipdv 

One-way-ipdv-stream 


IP Packet delay variation (one- way), 
end-to-end 2-point 

V Average delay 

V Interval-based limits 

V Quantile-based limits 


Loss 


One-way-packet-loss 

One-way-packet-loss-stream 

One-way-packet-loss- 

distance-stream 

One-way-packet-loss-period- 

stream 


IP packet loss ratio (IPLR) 


Transfer rate 

related 


No 


IP packet throughput (IPPT) 
Octet-based - " - (IPOT) 
Destination limited source 
Throughput probe 


Service Avail- 
ability 


No 


Yes, related to IPLR 


Others 


No 


Spurious IP packet rate 


Blocking 


No 


Defined in ITU-T E.493 


Set-Up delay(s) 


No 


Defined in ITU-T E.493 



Measurement by test traffic can give a wrong value, because e.g. test packets may be 
treated differently by nodes or test packets may take a different route or test traffic 
profile differs from user fraffic profile. The last problem is the most difficult, because 
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the user traffic is typically very bursty while test traffic is much less bursty. The pro- 
portion of test traffic is small compared to user traffic. Therefore the low QoS time 
slots are those, when the number of user packets is high. The proportion of test pack- 
ets during those low QoS slots is smaller than in average, therefore the low QoS slots 
get a smaller weight in the measurement by test traffic. Clearly, test traffic measures 
too high QoS. One possibility to avoid this problem is to ignore all but high traffic 
times. There is a standard method proposed by ITU-T in the Recommendation E.492 
[19]. This recommendation describes a way to select the high traffic load times when 
QoS/GOS-parameter values should be evaluated. In [19] one measurement value for 
traffic load is obtained by counting the traffic volume over a read-out time (ROT). 
Measurement over one read-out time does not give sufficient statistics, therefore n 
read-out times are concatenated making a measurement of n# read-out time = refer- 
ence period length (RPL). The measurement value for the traffic load for the time 
period is the average of the n measurement instants. The time axis is divided into 
periods of length RPL and the measurement is repeated in each time period. A refer- 
ence period is a selected time period of length RPL. The selection procedure is made 
as follows. For each working day, only the measured values for the highest traffic load 
period are kept. The total measurement is continued for 20 working days in a sliding 
window manner. Out of the 20 values the period giving the k:th highest traffic load is 
selected as the reference period and the QoS/GOS-parameter value is evaluated for 
this period only. L.492 proposes as k2 for high load and 4 for normal load reference 
periods. Let us see what fractile of the traffic this method gives. 

Probability, that the k th largest value in 20 trials is higher than the fractile f‘’ is 

* 20 ) 

g(q,k)%l&S I . 

m i ( ( 1 ) 

Theng’(^,2) %1 &20q^^ — 19^^° and, 

g(q,4) %1 &1 140^" - 3230^'' &3060g‘® - 969q^° . 

Solving g{q,k) %0.5 would give the mean, but let us evaluate the most likely value. 
Differentiating twice w.r.t. q and setting the second derivative to zero gives the most 

likely value. This shows that the second highest value estimates and the fourth 

highest estimates y of the distribution of traffic loads in the set consisting of the 
highest periods of each day. The second and the fourth day are very uncertain esti- 
mates as can be seen from the PDF ( 1 ) above. However, if we look at the fractile of 
traffic that these numbers estimate from the set of all periods in all days, it is seen, that 
the estimated fractile is rather constant and over 99%. This means that both the second 
and the fourth highest days give a period of quite high load compared to an average 
load. In order to express the fractile over the whole time axis of j periods of length 

RPL a day, it is needed to notice that the highest RPL is below the fractile f ^ with 
probability p\ Setting the second derivative w.r.t. p to zero for g{pYk) and 
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solving p gives the fractile that the numbers estimate over the whole time scale. For 
j %24 (RPL=1 hour), this leads to the result: the second highest value estimates 

y- 99.8/0 fourth highest estimates implying that these definitions result 

in practical and easily calculated estimates for maximum values for the QoS- 
parameters. It appears that for obtaining values for maximum delay and jitter the 
E. 492-method can be suitable. However, the k could be selected to give a lower frac- 
tile, as the Internet does not have so good QoS. For measuring low packet loss prob- 
abilities the method is less suitable it ignores almost all measurement periods and 
will take a long time to produce a result. 



4 Deflnitions of QoS/GOS-Parameters 

The following list of QoS/GOS-parameters seems sufficient for fixed IP-networks: 
delay, jitter, packet loss, connection blocking, some connection set-up times when 
relevant and transfer rate. These parameters are measured from the MRPs and there- 
fore are not directly the parameters visible to the end user in [20] they are called 
Network Performance Parameters. Transfer rate is a parameter related to throughput 
and it is usually included in network performance parameters. 

Evaluating the parameters for each connection rather than per CoS (Class of Serv- 
ice) -type and MRP pair has sense only if users of a class can systematically obtain 
different QoS than the average for a CoS type. This can be implemented by using such 
methods (e.g. traffic shaping) for traffic management which control the QoS so that a 
connection would systematically have a better value for a QoS parameter than what is 
obtained by counting all connections together, or if the traffic characteristics of the 
users (e.g. bursty connection) influence the QoS. In the following definitions 
QoS/GOS-parameters are not measured per connection but measured per CoS type 
and per a pair of MRPs because there is more statistical variation in the values meas- 
ured per connection and it is difficult to commit to target values for these measure- 
ments. 

If an IP-packet passes two MRPs, B and C in this order, the difference of the arrival 
time in B and the arrival time in B gives an instance of a stochastic variable one-way- 
packet-delay. We define the QoS parameter delay as the average of one-way-packet- 
delay instances over 15 minutes. The delay can be measured by time stamping user 
traffic. IP-packets sent from B are equipped with a time stamp from a local clock at B. 
At C the time stamp is read and compared to a time stamp from a local clock. Clock 
correlation can be done by clock synchronization. NTP (Network Time Protocol) can 
give a relatively accurate synchronization (error from 100 to 10 ms). If better accuracy 
is needed, GPS (Global Positioning System) can be used. We define the QoS- 
parameter jitter instance as three times standard deviation of one-way-packet-delay 
instances over 10 seconds. A number of such measurements are made over the time 
period of 15 minutes and the jitter is the average of the jitter instances. The jitter is 
measured in a similar way as the delay by using time stamps. This definition of jitter is 
a 2-point jitter (at MRPs) calculated per CoS-type and MRP pair. 
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The QoS-parameter loss could be defined as the ratio of lost packets to all packets 
passing the MRP B and being destined to an address, where the route passes MRP C. 
The loss would be counted separately for each direction B . C and C . B. A read- 
out time ROT should be chosen over which the loss ratio is counted. Packets, which 
are sent from B to C but are not received at C within a selected threshold-time, are 
counted as lost. The threshold should be large enough, that packets with a high jitter 
are not counted as lost. A small threshold would connect the QoS-parameters jitter 
and loss. This definition of loss is difficult to be measured using the MRPs of Fig. 1. 
Measurement of the loss probability between two MRPs B and C by user traffic is also 
complicated. To see if a packet is lost between two measurement points means that 
there is some way to know that it has not come in time. Loss could be measured also 
by measuring the packets lost in test traffic. However the test traffic does not have the 
same traffic profile as user traffic. The selection of a suitable rate for test traffic 
should not generate too much test traffic and should give a sufficient statistics for low 
target loss ratios. These problems of measurement have leaded us to define the loss 
differently. The QoS-parameter loss is the ratio of lost packets in the network to the 
total number of sent packets. It is counted by summing transmitted and discarded 
packets in each node. With this definition loss is a network wide parameter. Loss is 
measured so, that each network node calculates two numbers for a read-out period 
individually for each CoS type: The number of IP-packets discarded in the node dur- 
ing ROT and the number of packets arriving in the node during ROT. These numbers 
are summed over all nodes to get network-wide numbers. The loss is obtained by 
dividing the total number of discarded IP-packets during ROT by the number of all 
arriving IP-packets during ROT. Some useful numbers for this measurement can be 
obtained from SNMP MIBs (Management Information Bases) in the routers. 

If capacity reservation is used, then some connections will not be accepted due to 
insufficient network resources. This may be defined as connection blocking. Blocking 
should be taken as a QoS-parameter for capacity reservation schemes (e.g. RSVP), 
since if high blocking is allowed, connection QoS-parameters (delay, jitter, and loss) 
for accepted connections can have very good values but user satisfaction is poor. 
Definition of connection blocking is rather complicated if the user can request any 
bandwidth; Not all refused resource reservations are blocking in QoS sense: if a host 
has no resources, this is not a network problem. If a router has no resources, it may 
(e.g. if the user tries to reserve resources for a video stream and gets through only a 
reservation with bit rare below 64 kbps) or not (e.g. unreasonable user demand) be a 
network problem. One possibility is, that at the network point B the RSVP -message 
PATH is sent (not from the end-user) and depending on the CoS-class has a given 
minimum bandwidth request. Another, probably better, method is to have the reserva- 
tions ready so, that user s packets are only marked for a correct flow. The MRP B 
decides whether to accept a new user connection. Similar kind of approach is pre- 
sented in /21,22/ IETF MPLS WG internet drafts that describes e.g. RSVP extensions 
in order to establish label-switching paths, (automatic routing away e.g. from network 
failures, congestion, and bottlenecks). 

The transfer rate parameter is a compromise, which tries to give a value closely 
connected with a response time for interactive data traffic. The response time is influ- 
enced mostly by the throughput, not so much by the end-to-end delay. The throughput 
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refers to maximum throughput and can be only measured by loading the network with 
test traffic to maximum load. A simple test where FTP is tried between two MRPs 
should indicate if the network throughput is low and if the user will experience high 
response times. The transfer rate between MRPs A and B is estimated as file 
size/transfer time. 



5 Relation between the Transfer Rate, Loss and Delay 

The relation between the transfer rate and the delay and the loss is interesting because 
measurement of the transfer rate is easiest made with test traffic, but sending all the 
time test traffic for transfer rate measurement creates a heavy load. One possible solu- 
tion is to find a mathematical relation between the transfer rate of test traffic and delay 
and loss for the test traffic and calibrate the parameters. In the case when test traffic 
source is not capable of filling the network and there is not enough other traffic to fill 
the network : the source bit rate T is smaller than the network bit rate W. The source 
increases the window size to the limit imposed by the receiver window size. The 
source transfer rate is determined by the relation of the receiver window size K of 
TCP and 2a where 



2a % 



RTT#r 

B 



( 2 ) 



RTT is the round trip time and B is the (fixed) segment size. The transfer rate V is 
given by the classical formula for Go-back n ARQ without losses: 

V %r iox K ! 2a — 1 , v %r for AT 0 2a — 1 . 

l-2a 

The delay influences the transfer rate only if the round trip delay grows so large, 
that the window size becomes insufficient. The loss probability is zero in this case. 
Jitter has very little effect on transfer rate since TCP buffers segments in the receiver 
and reorders them - despite using Go-back n ARQ for retransmissions. TCP is cur- 
rently very robust against jitter and can calculate a good estimate for RTT even though 
jitter is quite large. In another case there is enough traffic to fill the network. TCP 
calculates SRTT (smoothed round trip time) in such a way, that RTO (retransmission 
timer) will not expire unless the change in traffic load is too fast. If the algorithms 
have time to adapt, the congestion avoidance and the slow start algorithms determine 
the behavior of TCP. Losses will be created by buffer overflows as the TCP sources 
speed up until they experience losses. Let denote the segment loss probability and 

for simplicity, that segment losses are not correlated - this is not quite true but it sim- 
plifies the analysis. Using RED (Random Early Detection) makes the losses less cor- 
related. The window size for TCP will be limited by the congestion window size. If 
the congestion window size is K, then in the time of receiving one segment the win- 
dow will grow to AT — 1 provided that there are no segment losses. The probability of 
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no segment loss is (l&Py-). If there is a segment loss, then the window size de- 
creases to KI2 . Let us denote the window size after receiving the n th segment by 
X ^ . With independent segment losses the expectation value for the next window size 

is in slow start E |X„ %k1%{\ 8cPf ){K -\)-PfK/2. Let us use the 
following steady state condition (which holds under some conditions): 
e\x„_,\x„ %k2%k . This gives P^ %2I{^K — 2) Clearly, this algorithm does 

not allow small loss rates as the congestion window would be much larger than the 
receiver window. Instead of slow start, congestion avoidance may be used. Using that 

algorithm we would gstPf %2 /(K ^ — 2) . However, state conditions may also lead 

to a quasi-periodic solution, which is not here further analyzed. In [23] there is a peri- 
odic solution, referred to as an equilibrium solution in the literature, like in [24]. 

With a receiver window size 32 this allows loss rates to go down to 0.002, but be- 
low that TCP looses its ability to slow down the rate. If TCP should reduce the rate so 
low that it operates in slow start, it also could not manage with small loss rates. This 
may indicate problems for offering low loss classes to TCP sources. If TCP sources 
stay in the congestion avoidance stage, they can support relatively low loss rates but if 
they would actually stay in this stage, then the throughput would vary relatively little. 
If the threshold K after which congestion avoidance is used is, say 8, and the receiver 
window is, say 32, then the throughput and user experienced delays would vary be- 
tween some acceptable delay D and 4D, which would probably still be acceptable. In 
reality the delays in the Internet are often unacceptable, which means, that the sources 
enter slow start quite commonly. In the slow start stage they require very high loss 
probabilities in order to control their rates, like 1/17 losses for K=32. 

If the segment loss rate is so large that the congestion window is smaller than the 
receiver window, the transfer rate is (under steady state conditions) given by a classi- 
cal formula for Go-back ARQ for K / 2a — 1 . Inserting K 3 2!P^ Sl2 yields a 

simple approximation for the transfer rate as a function of the loss rate and the round- 
trip time for the slow start: 



V %r 



K{l&Pf) 



/ 



2 \&P, 

— 3 -Br 

(l-2a)(l-P.(.^&l)) 3 PAB-r^TT) 



(3) 



and for the congestion avoidance: 

5 2Br ^ 



l&P, 



(B-r 4RTT)^ 1 &Pj- 



(4) 



As a conclusion, it may be possible to use mathematical relations for calibrating the 
transfer rate measurements so, that the transfer rate can be calculated from the delay 
and the loss ratio. For test traffic the other parameters (R, r) are either known and can 
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be measured. Such relation will contain several cases where different formulae apply. 
There should be a close connection between the transfer rate and the response times 
that a user will experience, even though they are not quite the same concepts. How- 
ever, measurement of the transfer rate using the given formulae has not yet been tested 
in a real network and there may be other factors that should be considered. There are 
several variants of TCP and modes like the fast recovery. The results in this section 
hopefully encourage more research on the feasibility of such a measurement. 



6 Measurement System 

A measurement system for QoS monitoring means a system where a network operator 
has a method to effectively monitor, preferably without physically moving, that the 
promised QoS is reached. The method should produce data of situations when QoS is 
not reached and of those situations where some improvement is needed in order to 
reach QoS targets. The measurement method should not overload the network, nor 
overload the operator in investigating the data, not fdl all storage space. A measure- 
ment can be continuous or put on and off for certain time. A QoS monitoring method 
must specify the following. What data is to be collected, how the data is to be col- 
lected, where the data is collected, when the data is collected, how the data is gathered 
for processing, when the data is gathered for processing and where and how the data is 
processed. Not all measurement methods allow similar data gathering methods. If long 
time-stamped records should be compared between two places, then the gathering 
method is not SNMP or CMIP, but can be SMTP, FTP, or possibly HTTP. The gath- 
ering method should support interactive requesting of data, which is much better for 
QoS monitoring for finding problems as they are transient and need fast response. 
Network management should be used (so no long records to be correlated). The 
method must not create too much information or too much processing, either by hu- 
man manager or by active network elements. Additionally, the method should be pos- 
sible to implement in a real IP -network and the technology should follow the usual 
methods for network management of IP -networks. 

Figure 2 shows the proposed measurement system. With a network management 
interface (like SNMP RMON) the operator is checking if target values for QoS are 
exceeded. The operator can query QoS/GOS-information. If QoS/GOS-target values 
are exceeded there comes a notification e.g., every 15 minutes, not too often, since in 
the time of high load the QoS measurement method should not produce additional 
high load. Notifications are filtered in order to decide whether anything needs to be 
done. The measurement reference points have synchronized clocks using NTP. They 
stamp some or IP-packets with the current time at one MRP and read the time stamp at 
another MRP. Only cases where QoS target values are exceeded are saved and results 
in a notification. 
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Figure 2. A measurement system 



In Figure 2, for packets going to another provider s network (the Internet in Fig. 2), 
there may be too high bit-rate for inserting time stamps, therefore the packets are only 
passively monitored. This means that incoming packets must already have the addi- 
tional field for time stamp if QoS is to be monitored. All packets are not assumed to 
have time stamps. Another aspect of creating a viable measurement environment is to 
make it specific to a certain network technique like in [14]. There the architecture of 
the network is built on Differentiated Services forwarding. As part of the network 
design is also an integrated measurement collection and dissemination architecture, 
and a set of common operational practices for establishing inter-domain reservations. 
Here in each node a basic set of QoS measurements are performed by variable combi- 
nation of techniques from active or passive measurement to SNMP polling of network 
devices. Additionally SNMP is not the only possible solution to use for exchanging 
policy information. Another solution is COPS, which was originally created for po- 
licing RSVP protocol. An extension to COPS for the Differentiated Services archi- 
tecture is also proposed in [15]. 



7 Connection Set-Up Delay Measurement 

Set-up delays are included in GOS-parameters. In the QoS-Intemet set-up delay 
monitoring will be needed e.g. in MPLS, which will modify the existing ATM consid- 
erably. In this section a measurement of set-up delays in ATM is explained. Mostly 
this measurement describes an implemented example how to monitor QoS/GOS- 
parameters along the principles of Figure 2. In the measurement RMON is used for 
collecting measurement data and RMON runs PERL scripts, which measure 
QoS/GOS-parameters. The script calculates the average for measured delays of setup- 
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connect and disconnect release and then initiates SNMP agent and listens appropriate 
SNMP port and performs the measurement. These delays however, are not direetly 
implemented into ATM drivers or into ATM switches. Therefore UNI 3.1 signalling 
timers featured in [17] are used in measuring these timers and the delays can be cal- 
culated respectively. We thank Janne Oksanen of LUT for designing and implement- 
ing the measurement system. Though made for ATM QoS-measurements, extending 
implemented MIBs by scripts suits equally well to IP. 



8 Conclusions 

The paper points out to some simple but not sufficiently well appreciated problems in 
measuring QoS using IPPM or 1.380 methods. They are based on test traffic and do 
not accurately measure QoS. Especially measurement of loss was seen to be difficult if 
loss probabilities are small. A method of counting discarded packets in routers was 
proposed, the paper gives the principles of the QoS parameter definitions used in 
[20]. This measurement system is currently implemented and measurements will be 
available in the summer 2000. [20] does not use SNMP for collecting measurement 
information. This paper shares the experiences from collecting QoS information from 
MIBs. This is a logical choice but has a practical problem that the MIBs do not cur- 
rently give enough QoS information. It might seem a trivial task to include them there 
but actually it is difficult, as the Managed Objects must be connected to a real re- 
source. We discovered, that this can be done practically and within an acceptable time 
using scripts such extension of MIBs enables fast experimentation of SNMP-based 
QoS-monitoring before QoS-monitoring MIBs are standardized and implemented as 
well as contribution to the MIB standardization. Relevance of the E. 500-series of 
recommendations was pointed out, together with a formula for the fractile E.492 
measures. Finally a simple and mathematically sound way of deriving a formula for 
the transfer rate of TCP was given. 
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Abstract. The essential aspect of the global evaluation of a service is 
the opinion of the users of the serviee. The result of this evaluation 
expresses the users degrees of satisfaction. A typical user is not 
concerned with how a particular service is provided, or with any of the 
aspects of the network s internal design. However, he is interested in 
comparing one service with another in terms of certain universal, user- 
oriented performance concerns, which apply to any end-to-end service. 
Comparable application-oriented evaluation methods and results are 
needed urgently. 

This paper outlines a generic quality evaluation methodology for 
multimedia applications like e-mail-based services, the World Wide 
Web (WWW), and real time applications. It complements today's 
quality frameworks outlined by international standardization bodies like 
ISO or ITU-T. 

The method introduced is applied to a videoconferencing service to 
describe its prineiples and benefits. The paper is for end users and 
(competing) serviee providers trying to get comparable evaluation 
results by applying common and simple measurement methods and by 
concentrating on a well-defined subset of quality characteristics. 



1 Introduction: Quality Assessment and Measurement 

Quality is the totality of eharacteristics of an entity that bear on its ability to satisfy 
stated implied needs. It can be seen as fitness for use , fitness for purpose , 
customer satisfaction , or conformance to the requirements (E.800, [1]). Within 
this paper, we concentrate on the last interpretation. Network Quality of Service 
(QoS) is a measurement of the network behavior and the definition of the 
characteristics and properties of specifie network services. It has been discussed 
intensively and for years among the networking experts. 

Unfortunately, Network Quality of Serviee mechanisms or methods do not 
guarantee the serviee quality that is pereeived by an end user (problem #1). The 
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situation is complex (see figure 1). A lot of factors have an influence on end user 
perceived quality and a lot of parties (individuals, organizations, providers) are 
involved and responsible. Nevertheless, end users do not want to go into the technical 
details and interdependencies of service provision, they want simple evaluation 
methods and finally comparable results. 

Service Level Management (SLM) is the set of activities required to measure and 
manage the quality of information service providers (e.g. telecommunications 
carriers) to customers and internal infrastructure departments to end users. SLM 
represents a broader framework for developing and executing Service Level 
Agreements (SLAs) [2]. 




Fig. 1. Quality Dependencies 

Barriers to effective Service Level Management are manifold. Beneath difficulties 
with defining and negotiating SLAs there is the problem of measuring the relevant 
values (problem #2). Often Quality of Service measurement is done by service 
providers or IS departments themselves. There has to be trust in the results achieved 
by them (problem #3). 

Vendors of Network Management (NM) tools today offer two different 
approaches, the integration of Service Level Management into existing products 
(modules) (e.g. Micromuse Neetcool [3]) or SLM software separated from NM 
software to attain new customers (e.g. Hewlett Packard with OpenView and 
Firehunter [4]). 

This results in a centralized solution that allows network operators to 
custom/design real time views of enterprise-wide services, providing operators with 
summaries of service availability and details of network-based events so they can 
circumvent service outages. Experienced staff is needed to operate and optimize these 
tools. Again end users/customers have very few possibilities to get information of 
their interest. By this, a general problem continues to exist: either there exist no 
implementations of a NM tool (problem #4), or they offer too much data 
(problem #5). 

In order to solve the problems mentioned so far, a new method for Quality of 
Service evaluation has been defined and applied. This paper structures as follows: 
Section 2 introduces the new method. The application of the new method to a 
videoconferencing service is described in section 3 where all quality-related 
characteristics are identified and selected. In section 4, we give a short introduction to 
agent technology as a basis for an implementation before section 5 describes our 
perception of agent technology for Service Level Management. Section 6 concludes 
the paper by doing a self-assessment of our work and progress achieved so far. 
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2 A New Method for Quality of Service Evaluation 

The service information provided to a customer is generally different from that 
used by a service provider managing the service. It is less detailed as the customer is 
only interested in the information relating to its own application/service and is not 
concerned with all the details of the service provision. Not all service characteristics 
are quality- or decision-relevant for an end user. Some are essential, others are non- 
essential. 

To make a comprehensive assessment of a service, the elements to be decision- 
relevant have to be identified and selected. A selection or a reduction to the essentials 
is subjective but it has the advantage that an end user is participating from the 
beginning. We therefore propose in table 1 an end user focussed quality evaluation 
method derived from software quality evaluation. 



Table 1. A New Quality Evaluation Method 



Step 


Action 


Meaning 


1 


Identification 


All quality characteristics (i.e. essential and non-essential) have to 
be identified. 


2 


Rating 


From the list of all quality characteristics found, a rating has to be 
made on the essential quality characteristics. 


3 


Selection 


The quality characteristics to be measured have to be selected. 


4 


Quantification 


For each characteristic selected, a quantifieation/threshold 
has to be defined (parameter- value pair). 


5 


Measurement 


The measurement has to be done on behalf of or initiated by an 
end user taking the thresholds into account. 



Now we want to apply this method to a real time application represented by a 
videoconferencing service which is a much harder piece of work than in [5]. 



3 Evaluation of a Real-Time Application 

In the following sections, we restrict on videoconferencing as real time service 
following ITU-T H.323 (i.e. LAN- or Internet-based). H.323 components are 
Terminals, Gateways, Gatekeepers, Multipoint Controllers (MCs) and Multipoint 
Control Units (MCUs) (see figure 2). As a central unit, a MCU has special 
importance, as H.323 allows different types of conferences: 

Point-to-point (2 terminals) or multipoint (3 or more terminals i.e. without MCU) 
Centralized multipoint (MCU), decentralized multipoint, and hybrid/mixed forms 
(audio/video) 

With other H-series, GSTN (General Switched Telephone Network) or ISDN 
terminals 

With audio content only 

With video and data content additionally 
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At GMD, we tried to classify available desktop videoconferencing products from 
1996 on. The goal was to obtain a representative set of products for each quality- level 
and platform - from low-end PCs and Macs to high-end workstations, from cheap, 
black-and-white QuickCam cameras to 155 Mbps ATM conferences ([6], [7]). 




3.1 Step 1: Identification 

Let us assume for the following considerations we want to book a basic 
videoconferencing service, with 3 participants (i.e. multipoint with MCU), and with 
audio and video content, i.e. no data (no T.120). The communication should take 
place between H.323 and non-H.323 terminals. This means that the H.323 service to 
be measured has to provide at least one MCU and at least one gateway. 

In this first step, all essential and non-essential service quality characteristics for a 
videoconferencing service are identified. We take a technical specification document 
(H.323 [8]) and the standards mentioned herein as a starting point for our quality 
evaluation method. Unfortunately, we have to go into the details of the standards 
involved as videoconferencing is much more complex than e-mail or online 
services [5]. 

Today there exists one recommendation from ITU for desktop videoconferencing 
over all kinds of networks: H.200 [9]. This is a general framework, which includes a 
lot of other standards. Products never refer to H.200 itself, but to a recommendation 
which works on a specialized network, e.g. H.323. 

H.261 [10] describes the video coding and decoding methods for the moving 
picture component of audio-visual services at the rates of x 64 kbit/s, where p is in 
the range 1 to 30 (see table 2). 
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H.263 [11] specifies a coded representation that can be used for compressing the 
moving picture component of audio-visual services at low bit rates. The source coder 
can operate on five standardized picture formats: sub-QCIF, QCIF, CIF, 4CIF and 
16CIF (see table 2). For H.323. only H.261 with QCIF format is mandatory. 



Table 2. ITU Video Encoding for a H.323 Videoconference 



Standard 


Characteristic 


Subcharacteristic 


Attribute Value 


H.261 


media 




analog and digital video 




formats 


CIF 


352 pixels/line x 288 lines/image (chrominance), 
176 pixels/line x 144 lines/image (luminance) 






QCIF 


176 pixels/line x 144 lines/image (chrominance), 
352 pixels/line x 288 lines/image (luminance) 




frame rate 




29,97 frames/sec (Hz) 




bit rate 




64 kbit/s to 2 Mbit/s 


H.263 


media 




analog and digital video 




formats 


SQCIF 


128 pixels/line x 96 lines/image (chrominance), 
64 pixels/line x 48 lines/image (luminance) 






QCIF 


see H.261 






CIF 


see H.261 






4CIF 


704 pixels/line x 576 lines/image 






16CIF 


1408 pixels/line x 1 152 lines/image 




frame rate 




up to 30 frames/sec (Hz) 




bit rate 




Variable 



The audio encoding standards supported by FI. 323 are G.711, G.722, G.723, 
G.728, and G.729 (see table 3). They define how audio is digitized and sent. G.711 is 
mandatory, the other standards are optional. G.7 1 1 standardizes the digitization of 
speech to support digital telephony over voice-grade cables. It is based on the PCM 
(Pulse Code Modulation) digitizing algorithm using logarithmic coding. 

The different ITU standards for the digitizing and compression of audio signals 
reflect different relationships between audio quality, bit rate, computer power (e.g. 
load on a 100 MHz processor), and signal delay (see table 3). Most of them are based 
on the adaptive differential pulse code modulation technique (ADPCM). G. 722 is a 
more sophisticated standard which targets improved quality h 7 kHz bandwidth 
instead of 3.4 with other schemes h at 48 to 64 kbit/s. G.728 targets low bandwidth, 
16 kbit/s only, but the resulting quality is inferior to that of the other standards [12]. 

T.120 [13] is the umbrella for a series of ITU recommendations which provide a 
means for telecommunicating many forms of Data/Telematic information between 
two or many multimedia terminals and for managing such communication. 
Compliance with T.120 is an issue as such. Additionally, T.120 gives valuable hints 
for the thresholds for the Multipoint Communication Service (MCS) described in the 
recommendations T.122 and T.125 (see table 4). MCS domain parameter settings 
have to adhere to the specified ranges in all situations. 
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Table 3. ITU Audio Encoding for a H.323 Videoconference 



G.711 


G.722 


G.723 


G.728 


G.729 


Title Pulse Code 


7 kHz Audio- 


Dual rate speech 


Coding of Coding of 


Modulation coding within coder for 


speech at 16 speech at 8 


(PCM) of 


64 kbit/s 


multimedia 


kbit/s 


using kbit/s using 


voice 




communications 


low-delay conjugate- 


frequencies 




transmitting at 


code excited structure 






5.3 and 6.3 kbit/s 


linear 


algebraic-code- 








predication excited linear- 










prediction (CS- 










ACELP) 


Sampling Rate 8 


16 


8 


3.3 


8 


[kHz] 










Bit Rate [kbit/s] 64 


48, 56, or 64 


5.3 or 6.3 


16 


8 


Audio Quality 3.4 


0.050 to 7 


4.3 


3.4 


4.3 


[kHz] 










Computer Power 1 




40 




50 


[%] 










Signal Delay 0.25 


4 


37.5 


0.625 


15 


[ms] 










Table 4. MCS Domain Parameters 


MCS Domain Parameter 




Minimum Value 


Maximum Value 


Channels in use simultaneously 


10 




65 535 


User ids assigned simultaneously 


10 




64 535 


Token ids grabbed/inhibited simultaneously 


0 




65 535 


Data transfer priorities implemented 


1 




4 


Maximum height 




2 




100 


Maximum size of domain MCSPDUs 


128 octets 




4096 octets 


Protocol version 




Version 2 




Version 2 



Communications under H.323 can be considered as a mix of audio, video, data, 
and control signals. Audio capabilities, Q.931 call signaling and setup, RAS control 
(H. 225.0), and H.245 signaling are required. All other capabilities, including video 
and data conferencing, are optional. 

Basically, H. 225.0 describes how audio, video, data, and control information on a 
non-guaranteed quality of service LAN can be managed to provide conversational 
services in H.323 equipment [14]. The Registration/ Admission/Status (RAS) 
signaling function uses H. 225.0 messages to perform registration, admissions, 
bandwidth changes, status, and disconnect procedures between endpoints and 
Gatekeepers. H.245 specifies terminal information messages as well as procedures to 
use them for in-band negotiation of the communication [15]. 

As the ITU F-Series in general describe service aspects of telecommunication, 
ITU-T F.702 [16] finally defines a videoconferencing service as opposed to the H- 
Series which describe technology. F.702 classifies the videoconferencing into two 
main categories: basic videoconferencing services and high-quality 
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videoconferencing, the goal being broadcast-like quality (see table 5). F.702 also 
addresses the issue of interoperability. 



Table 5. F.702 Quality Characteristics 



Quality Characteristic 


Basic Videoconferencing 
Service 


High Quality 

Videoconferencing Service 


Picture Quality 


Spatial resolution: CIF for 
limited movements 


Television broadcasting 
quality (Recommendation 
ITU-R 601), high definition 
television quality (HD-TV) 


Audio Quality 


Wideband audio: 7 kHz 
(G.722 operating at 48 kbit/s) 


Stereophonic audio: 7-15 
kHz 


Differential delay between audio 
and the video signals 


lip-synchronism: 
difference between the 
playout of sound and the 
display of video should not 
exceed 100 ms 


lip-synchronism 


Qverall delay (transmission delay 


< 400 ms 


< 400 ms 


+ terminal/equipment delay) 


(without video, audiographic) 


(without video, audiographic) 


Disturbances when switching 
between videosources or when 
changing the video channel bit 
rate 


for further study 


for further study 



3.2 Step 2: Rating 

Based on the output from the first step where all quality characteristics have been 
identified, now the essential service quality characteristics are separated from the non- 
essentials. This is done with participation of an end user. 

Some general principles in end user perception of a videoconferencing service 
apply. The human eye can not distinguish small differences in picture quality. The 
perception of the audio signal is more important. A good audio codec always is a 
compromise between audio quality and bit rate. Problems with the entities involved, 
such as denial of service, jitter, and poor audio/video synchronization resulting from 
network load, streaming problems, incompatible files, or interoperability issues have 
an additional major impact on end user s perception. High data transfer rates mean 
that digital video can be transported directly from the digital source into the receiving 
computer with no processing delays. High transport speeds however can result in 
latency (elapsed time) problems requiring significant buffering capacity at the 
receiving computer. 

The basic question before measuring a videoconferencing service to be answered 
beforehand is: what are the entities/components to be measured? As pointed out in 
figure 1, Quality Dependencies, the local infrastructure quality is only one part of 
the game. 
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The UKERNA Videoconferencing Advisory Service (VCAS) undertook a number 
of technical studies including the evaluation of videoconferencing equipment. The 
main goal of this evaluation was to provide objective advice for Higher Education and 
Research Institutions so that they may make informed choices when purchasing 
videoconferencing equipment for use over ISDN networks [17]. 

Table 6. Rating of Videoconferencing Quality Characteristics 



Quality Category 


Characteristic 


Audio Quality (G.71 1) 


Sampling rate [kHz] 
Bit rate [kbit/s] 
Audio Quality [kHz] 
Signal delay [ms] 


Video Quality (H.261) 


QCIF format 
Frame rate 
Bit rate 


Basic Service 


Audio/video delay 
Overall delay 


Performance 


Availability 

Reliability 


Interworking 


Point-to-point compatibility 
Compatibility through MCU 


Ease of Use 


Quality of documentation 
Booking procedure 
Session set-up procedure 
Mode of operation 
Problem and error handling 



Out of the broad set of standards involved in videoconferencing, the UKERNA 
engineers identified a set of quality characteristics to be assessed and to be measured 
as a prerequisite to participate in their videoconferencing service. Again, there is a 
differentiation between subjective and objective characteristics for evaluating 
products or services: Technical considerations such as audio quality and vision 
quality are paramount, and the ability to transfer data transparently is a desirable 
feature. However, just as important for the user, especially a non technical user, is the 
ease of setting up and operating the equipment, dialing the remote site and the 
reliability of connections during a conference. 

We took these results in addition to our own experience [6], [7] into account. The 
specific quality characteristics rated are shown in table 6. 



3.3 Step 3: Selection 

In this step, a selection has to take place in order to identify those essential quality 
characteristics which can be measured in an automatic and economic way (i.e. good 
to measure). Once again, the expectations and requirements of an end user determine 
the output of this third step. 
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The measurement of audio and video quality is based on a comparison of the input 
signal (audio/video) at a sending videoconferencing participant s site with the output 
signal (audio/video) at a receiving videoconferencing participant s site. The general 
set-up will be described in the next chapter. The methods used (lip synchronization, 
block distortion, blurring, color errors, jerkiness etc.) go deep into the physics of 
audio and video signaling and have to be elaborated by us before doing the 
implementation. 

In table 7, we selected those quality characteristics that crystallized out to be most 
important in our trial [6], [7]. Naturally, as the evaluation method described is 
flexible, an end user can specify a completely different list in this step. 



Table 7. Quality Characteristics to be Measured 



Quality Category 


Characteristic 


Attribute Value 


Audio Quality (G.71 1) 


Sampling rate [kHz] 


16 




Bit rate [kbit/s] 


48, 56, or 64 




Audio Quality [kHz] 


0.050 to 7 




Signal delay [ms] 


4 


Video Quality (H.261) 


QCIF format 


176 pixels/line x 144 lines/image 
(chrominance), 352 pixels/line x 288 
lines/image (luminance) 




Frame rate 


29,97 frames/sec (Hz) 




Bit rate 


64 kbits/s to 2 Mbits/s 


Basic Service 


Audio/video delay 


Lip-synchronism/difference < 100ms 




Qverall delay 


< 400 ms (without video, audiographic) 


Performance 


Availability 


24 hours/7 days/52 weeks or time interval 
booked 




Reliability 


Lost session per usage <= 3% (Pass/Fail) 



After this step, we now have selected the quality characteristics to be measured. 



3.4 Step 4: Quantiilcation 

In this step, a parameter-value pair for each quality characteristic to be measured 
has to be defined. An end user brings in target values and thresholds. This step results 
in real numbers for the measurement. 

The quantification of the relevant quality characteristics is also described in table 7 
(column 3). The attribute values base on the definitions in the standards involved, on 
product/service specifications available on the market [17], and on our own 
experience in providing value-added services for years. 



3.5 Step 5: Measurement 

After the last step, we now have defined the boundary conditions/thresholds 
(parameter- value pairs) for the measurement to be performed in this step. The 
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measurement itself provides results that cover end users quality interests and 
concerns, that address quality characteristics that can be directly observed and 
measured at the point at which the service is accessed by the user [1], that focus on 
major quality-related aspects, and that are reproducible, comprehensible, trustworthy, 
and comparable. 

In order to perform step 5, we want to use intelligent software agents again as 
in [5], 



4 Quality Evaluation for a Videoconferencing Service 

Setting-up quality evaluation for a videoconferencing service means that there 
have to be at least three participating agents. To achieve relevant results some 
requirements regarding the location of the agents have to be met. We see three 
possibilities to compare the input/output signals, to measure the delays, and to verify 
availability and reliability: 

1. located at the central units under test (e.g. MCU, Gateway, Gatekeeper) 

2. located on separate computers in the LAN of the videoconferencing service 
provider 

3. located on separate computers outside of the LAN of the videoconferencing 
service provider 

The first two possibilities are not practicable because normally the resources 
needed are not available for end users. Besides of this there are severe security 
problems which would have to be overcome. For providers themselves it is applicable 
however. So option three is the way to go (see figure 3). To avoid any network 
influence on the results of the measurements (i.e. network-independent measurement) 
the measuring agent has to be connected through a QoS network (H.320, H.321, or 
FI. 322). The measurement by one agent is sufficient then to evaluate the service 
because of the guaranteed network quality. No other statistical methods are needed. 




Figure 3. Videoconferencing Service Test Configuration 
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The software agents #1 and #2 have to be H.323 terminals in order to be able to 
participate in a videoconference following H.323. Additionally, agent #1 has to act as 
a streaming server for audio and video to produce the comparable signals needed. The 
measurement agent (#3) needs protocol-independent features to measure 
characteristics not covered by the videoconferencing standards themselves (e.g. 
performance) and it has to have the H.320, H.321, or H.322 protocol stack 
implemented (see figure 3). 

Finally it is worth to be mentioned that the agents will do only good-natured 
measurement. They will measure the audio/video streams as is without any 
manipulation i.e. without interfering packet loss, intra delay jitter (within one data 
stream) or inter delay jitter (between audio and video data) to test the robustness and 
recoverability of the videoconferencing service software. 



4.1 Agent Usage of Standardized Protocols 

For the measurement agent, the H.323 protocol already defines several quality- 
related protocol operations to be performed. 

RTCP for Network Performance Measurement 

Figure 1 shows the interdependence of end user perceived quality and network 
quality. A given level of transport QoS results in a level of user-perceived 
audio/video QoS that is a function in part of the effectiveness of the methods used 
to overcome transport QoS Problems (H. 225.0, [14]). We always wanted, 
though being application-oriented, a simple method to check network availability 
and performance on easy terms. H. 225.0 provides a means for the user of H.323 
equipment to determine that quality problems are the result of LAN congestion 
by usage of the RTCP (Real time Transport Control Protocol [18]). 

The Sender and Receiver Reports of RTCP can be used for network QoS 
measurement. There are two types of congestion which can be detected: a short 
term congestion resulting in a perceptible but not annoying slowing of the frame 
rate and a LAN congestion over time resulting is a general delay. 

Q.93 1 for Availability Measurement 

H. 225.0 does not contain quality-related characteristics but it defines the usage of 
messages like IRQ (Information Request) and IRR (Information Request Report) 
[19] which can additionally be used for quality measurement on application level: 
this information can be used by third party maintenance devices to monitor 
H.323 conferences to verify system operation [8]. 

H.245 for Overall Delay Measurement 

Recommendation H.245 [15] provides commands and procedures for round trip 
indications using RoundTripDelay Request and RoundTripDelayResponse. 



4.2 Input Data 

The American National Standards Institute (ANSI) provides a collection of video 
test scenes for subjective and objective performance assessment [20]. The scenes 




QoS Assessment and Measurement for End-to-End Services 205 



represent limited examples of the content that may be found in video 
teleconferencing/video telephony usage. Some test scenes have an accompanying 
audio track. ANSI additionally specifies methods of measurements for video 
performance parameters for the end-to-end transmission quality. However, they are 
hardly understandable and applicable for end users. Nevertheless, the video test 
scenes can be used as a well-accepted set of input signals to do the quality evaluation. 



4.3 Implementing Agent-Based Quality Evaluation for a Videoconferencing 
Service 

In [5] we applied our new quality evaluation method to an online service and we 
found the essential, good to measure quality characteristics. The quality evaluation 
was done by agents. 

In our agent implementation we did not want to start from scratch. Fortunately, 
there are several agents development platforms available today. These tools already 
meet some basic requirements to develop different agents. After an evaluation (e.g. 
[21], [22], [23], [24]) the ZEUS Agent Building Toolkit from British Telecom [21] 
was chosen since it offers a solution for most of the needs, has an open design 
approach, and is extensible. 

For the videoconferencing scenario, we found out that no Java-based H.323 clients 
are available. So we decided that we had to implement them by ourselves again not 
starting from level zero. Some H.323 protocol stacks/development platforms available 
(e.g. by RADVision) are written in C but we wanted to use a Java-based H.323 
development platform for the same reasons why we had chosen the ZEUS platform. 
The related announcements at different places (i.e. product advertisement, discussion 
lists, international working groups) sounded quite promising. 

JAIN by Sun stands for Integrated Network APIs for the Java Platform [25]. The 
JAIN initiative consists of two Experts Groups, the Protocols Expert Group (PEG) 
and the Application Expert Group (AEG). A JAIN H.323 API is on the to-do list but 
the Protocol Expert Group responsible has since 1999 not found a chairman yet. 
Therefore we do not expect any results within the next six months. 

Besides this we looked at the Java Media Framework (JMF) which is an 
application programming interface (API) for incorporating audio, video and other 
time -based media into Java applications and applets [26]. JMF offers the ability to 
capture, playback, transmit and transcode audio, video and other media. Several audio 
and video formats are implemented but there is no support for the system control 
standards needed to implement a H.323 stack. 

At present we are waiting for a Java-based H.323 development platform/API 
which has been announced by IBM in the context of the alphaWorks programme [27]. 
It is expected to be available this summer. Based upon SUN's Java Telephony 
Application Programming Interface (JTAPI) IBM will publish a Java API of the 
H.323 protocol stack. This API contains a set of classes, interfaces, and principles of 
operation for the realization of software interfaees to applications and services based 
on the H.323 standard. 
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5 Conclusion 

Quality evaluation is often discussed as a complex procedure for managing all 
potential attributes and topics. This type of quality management is not very effective 
because it does not help identify the decision-relevant criteria. This paper has 
demonstrated the feasibility of end user focussed QoS evaluation. With our approach, 
we have solved the problems #2-5 mentioned in the introduction. Problem #1, the 
difference between network and end user perceived (application) quality, has to be 
addressed by a specific information policy. 

As shown in the videoconference example, which we simplified for better 
understanding, one has to identify what has to be evaluated and the basic 
characteristics of that service. In our examples, the standard [8] is the major source 
for the characteristics and their derivations. The next step was to quantify the required 
characteristics and to instrument the measurement modules in terms of agents. 

We have outlined our implementation status so far. At the time of the QoflS'2000 
workshop, we will be able to report on progress and on our experiences made. The 
advantage of the approach presented is: one gets an applicable method for QoS 
evaluation. If the characteristics are carefully selected, one can achieve efficient 
evaluations. 
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Abstract. In this paper, the concept of fairness as a future held of re- 
search in computer networks is investigated. We motivate the need of 
examining fairness issues by providing example future application sce- 
narios where fairness support is needed in order to experience sufficient 
service quality. We further demonstrate how fairness dehnitions from 
political science and in computer networks are related and, hnally, con- 
tribute with this work to the ongoing research activities by dehning the 
fairness challenge with the purpose of helping direct future investigations 
to the white spots on the map of research in fairness. 



1 Introduction 

Fairness in computer networks deals with the distribution of network resources 
among applications; i.e., fairness is achieved when network resources are dis- 
tributed in a fair way. Investigating fairness in computer network aims at two 
goals. The first goal is to improve the behaviour of networking architectures 
by adding the valuable concept of distributing resources fairly, which should be 
considered both for existing and for new scenarios. We call this concept macro- 
fairness, because it deals with the distribution of the overall network resources. 

The second goal is to enable new (fair) applications that are currently not 
implemented in existing networks for various reasons. We refer to this concept 
as micro-fairness. Micro-fairness aims at achieving a fair distribution of the net- 
work resources at a much finer granularity and is necessary to provide the needed 
service quality for certain applications. For example, with micro-fairness, two 
packets leaving a single source at some point in time for two different destinations 
might be required to reach their destinations at exactly the same moment. 

Macro-fairness has been studied to a big extend, whereas micro-fairness 
still lacks a lot of further investigation. In the remaining paragraphs of this 
introduction, we classify micro-fairness in the hierarchy of needs of users in 
computer networks and motivate it via new (fair) application scenarios that 
require a certain level of fairness. 

Current computer networks fulfill most needs of their users. Email, file trans- 
fer, WWW access, IP-telephony, video-conferencing, etc. are widely deployed and 
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more or less well supported by most computer networks. Nevertheless, there ex- 
ist real-time application scenarios that are yet not implemented, examples of 
which are tele stock trading (we think of intra-day stock trading from home), 
large-scale distributed real-time games, real-time tele auctions, etc. We believe 
that the main reasons for these applications not to be deployed are insufficient 
existing networking mechanisms. 

At an abstract level, the needs of users in computer networks can be struc- 
tured hierarchically as a pyramid (see Figure 1), which can be somewhat likened 
to Maslow’s pyramid of human needs [1]. 




The minimum level of users’ needs in a computer network is basic data de- 
livery functionality, which provides asynchronous off-line data delivery with no 
explicit requirements on the time or duration of delivery. At a service level, 
simple email service can be regarded as an example. 

The next higher level in the pyramid adds advanced data delivery function- 
ality: users want to be provided with synchronous, interactive, and/or two-way 
on-line data delivery. Examples of such services are file transfer and WWW. 
Note that at this level, there are still no hard bounds on the time, duration and 
delay of data delivery. 

The third level adds quality of service ( QoS). Users want to perform real-time 
multimedia communication involving data streaming for audio/video. Therefore, 
they need a communication system that offers sufficient quality of service for 
the traffic. Note that in this context, quality of service means abstract user 
requirements concerning the data delivery and does not necessarily mean that 
the communication system needs explicit QoS support: even in well-provisioned 
best-effort networks users might be content with the quality of service they re- 
ceive, while the network itself has no explicit mechanisms for providing QoS. 
Examples of communication that need a certain level of quality of service are 
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telephony, video-conferencing, tele-education, tele-medicine, and many more. At 
the highest level of the pyramid, the technical feasibility for most applications 
is already assured through the lower three levels. Nevertheless, there are appli- 
cation scenarios where mechanisms and guarantees are needed that are beyond 
current technology. Such application scenarios include real-time tele-stock trad- 
ing, tele-voting, large-scale distributed games and real-time electronic auctions. 

For these applications, quality of service provision in a network alone does 
not suffice: for example, users want not only to be sure that the communication 
system provides the service with its required QoS guarantees, they also want a 
guarantee that they have at least the same opportunities as their competitors. 
Fairness is therefore one main aspect of the highest level of the pyramid. In 
addition to fairness, other needs, such as network availability, security, etc. are 
also located at the highest level. Note that especially due to the heterogeneity 
of current networks, the requirements of such applications cannot be fulfilled 
through QoS guarantees only. 

The lower three levels, which are necessary to make network services func- 
tional, have been addressed in detail in literature and are still worthwhile a lot 
of further discussion (to which this paper will not contribute). Unfortunately, 
high-level concepts to provide the actual service quality for many applications, 
in particular fairness concepts in computer networks, have not been examined 
as thoroughly. Especially, it seems that an overall view of fairness concepts is 
missing. This paper is intended to fill this gap and to shed some light on the 
importance of fairness concepts for computer networks. 

The following parts of the paper are organized as follows. Section 2 borrows 
models from political science to give both a common sense definition and a for- 
mal description of the concept of maximizing welfare. The concept of maximized 
welfare in political science corresponds to the commonly used definition of fair- 
ness in computer networks. In Section 3, existing approaches to macro-fairness 
and new aspects that come with micro-fairness are presented. 

Section 4 concludes this work with an overview and discussion of open issues 
and challenges in fairness research for computer networks, thereby defining the 
fairness challenge with various facets that should be investigated within future 
research. 



2 What Is Fairness ? 

The concept of fairness has been studied in various scientific areas. Most thor- 
ough and theory-based approaches arose from the field of political science and 
political economics: fair allocations of consumption bundles in an economy have 
been investigated, and a common sense definition of a fair allocation is given as 
“an allocation where no person in the economy prefers anyone else’s consump- 
tion bundle over his own” [2], i.e., “a fair allocation is free of envy” [3]. Even this 
very general definition indicates the conceptual difficulty of fairness: in order to 
ensure fairness in a system, all system entities have to be satisfied with their al- 
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located share of the system’s goods. Therefore, the distribution mechanism has 
to take into account the subjective preferences of the system entities. 

The famous problem of how to divide a cake fairly into pieces for a number n 
of hungry and competing cake eaters has been examined in various early studies 
in the field of econometrics (see, e.g., [4,5] and [6]). The cake division prob- 
lem illustrates well the difficulty of the fairness concept for simple distribution 
problems^. It should be noted that any algorithm that solves the cake division 
problem requires active participation of the cake eaters, i.e., they have to signal 
their preferences to make sure that they are content with the outcome. 

In computer networks, the situation is very similar: resources have to be dis- 
tributed among competing users of the computer network, which can be likened 
to distributing pieces of cake to competing cake eaters. The practical problem 
with applying the cake division algorithm to resources in computer networks is 
that it would require active signaling of the users’ preferences upon all changes 
of resource distribution in the network. For scalability reasons, this approach is 
clearly not feasible. 

In order to avoid this problem of continuous signaling, a concept to express 
the preferences of a user has been developed: utility functions (see [7]). In order 
to analyze and formalize computer networks, utility functions are defined for 
networking applications as functions that map a service delivered by the net- 
work^ into the performance of the application for that service. Utility can be 
considered a measure of how much a user would be willing to pay for the ser- 
vice [8]. In Section 3, we will see that macro-fairness is related to individualistic 
utility functions, while micro-fairness relates to another type of utility function, 
which we call group- constrained utility. 



2.1 Pareto-EfRciency and Welfare 

The concepts pareto- efficiency and welfare in political economics are strongly 
related to the concept of fairness and will be briefly revised to provide a more 
theoretical definition of fairness. 

Let us follow [9] to define pareto- efficiency: 

In general, an allocation ui of resource bundles (xi, ..., Xk) is feasible if the excess 
demand z{ui) for that allocation is < 0. The excess demand is the aggregate 
vector of demands reduced by the aggregate vector of resources available; thus, 
an allocation is feasible, if the aggregate supply of resources exceeds or equals 
the aggregate resource requirements of users. 

A utility allocation Ui represents user I’s utility of an allocation lo for a resource 

^ An algorithm that solves the cake division problem proposed by Banach and Knaster 
(see [4]) is very simple: a knife is moved at constant speed over the cake and is poised 
at each instant, s.t. it could cut a unique slice of the cake. Thus, the potential slice 
increases monotonely until it becomes the entire cake. The first person to indicate 
satisfaction with the slice determined by the position of the knife receives that slice 
(if two persons indicate satisfaction simultaneously, the slice is given to any one of 
them). Then, the rest of the cake is distributed using the same constructive method. 
^ The service describes all relevant measures, such as delay, throughput, loss rate, etc. 
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bundle {x\, ...,Xk)- A utility allocation u\ is dominated by uf, if is feasible 
and uf > u], i.e., if uf is prefered to uj. 

A utility allocation u} is pareto- efficient, if it is feasible and not dominated by 
any other feasible utility allocation u^. In more general terms, “a situation is 
pareto- efficient, if there is no way to make any person better off without hurting 
anybody else” (see [7], Section 16.9). 

Pareto-efficiency is clearly a desirable criterion of an allocation. Nevertheless, 
it is only a weak criterion. The problem is that also an allocation, where one 
user gets everything can be pareto-efficient, and this allocation is certainly not 
fair. 

Welfare extends the concept of pareto-efficiency in a certain manner: the 
basic problem of welfare (see, e.g., [10]) is to determine, which of the feasible 
allocations uj{xi, ...,Xk) should be selected. For that reason, it is assumed that 
there exists a general welfare function W{ui,U 2 , ■■■,Un) that aggregates the in- 
dividual utility functions ui of the users. A welfare function is required to be 
increasing in all of its arguments. 

It can be shown that any feasible allocation of maximum welfare must neces- 
sarily be pareto-efficient^. For that reason, it seems to be very desirable to find 
an appropriate welfare function and perform the maximization in order to receive 
maximum welfare while being pareto-efficient. The problem with this approach 
is the welfare function itself, since it is not clear how to perfectly aggregate 
individual preferences. 

2.2 Examples of Welfare Functions 

For different purposes, different examples of welfare functions exist, each corre- 
sponding to a different criterion of welfare. For an introductory overview and 
comparison of different criteria of welfare see [12]. 

One criterion is the maximin criterion, which corresponds to the Rawlsian 
welfare function W{u\, ...TUn) = min{ui, ...,Un)- The maximin criterion weighs 
only the utility of the worst-off user. 

The sum of utilities criterion corresponds to the classical utilitarian wel- 
fare function W{u\, ...,Un) = contrast to the maximin criterion, this 

criterion weighs the utility of each user equally. 

Both these criteria have certain problems: the maximin criterion does not 
weigh improvements of those who are not least well off; and the sum of utilities 
criterion might prefer a situation where some users are very happy and others 
are very miserable, rather than allowing an allocation where all users are ’’just 
happy”, i.e., in between extremely happy and very miserable. 

These two criteria can be regarded as the limiting cases. In between, there 
exist a whole range of various compromise welfare functions all aiming at different 
goals. One example is the weighted- sum- of -utilities welfare function W {ui, ..., Un) 
= where Oi is a weight assigned to Ui, thereby expressing individual 

priorities between different users. Another example is the sum- of- square-roots 

For a simple proof, see, e.g., [11]. 
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function VF(ui, where users with smaller utilities are given 
higher relative priorities. 

Yet another, but very interesting welfare function is the sum-of-logs function 
W{ui, ...,Un) = '^ilog{ui)^ which corresponds to the so-called Nash criterion. 
Note that the Nash criterion maximizes the product of additional utilities com- 
pared to the status quo. It has been first described by Nash [13] as the solution 
to the bargaining game in game theory. This maximized welfare function has the 
property that its outcome is not affected by any linear transformation of a user’s 
utility scales: if a user’s utility function is transformed using a positive linear 
transformation, the solution to maximizing the welfare function yields an allo- 
cation which is identical to the allocation before transformation. Therefore, this 
type of welfare function is independent of changing the scales of the individual 
utility functions, and inter-user comparisons of utility are not required, which is 
an interesting property, since the transferability of utility remains questionable. 

3 Fairness Concepts in Computer Networks 

In current computer networks, the term fairness corresponds to the concept of 
maximum welfare as defined in the previous section. The following subsections 
present the most common fairness definitions using the terminology presented 
in Section 2, give an overview of existing concepts, mechanisms and open ques- 
tions in computer networks regarding macro-fairness, and introduce the new 
challenges entailed by micro-fairness. 

3.1 Examples Fairness Criteria 

In the following two paragraphs, we briefly demonstrate how the most common 
fairness criteria in packet-based communication networks, maxmin fairness and 
proportional fairness, can be defined using the concepts of maximum welfare 
presented in Section 2. 



Maxmin Fairness The most popular fairness concept in computer networks, 
maxmin- fairness [14], corresponds to the Rawlsian welfare function W{ui,. . ., Un) 
= min(ui, . . . ,Un) with the individualistic utility functions Ui{x\, . . . ,Xn) = 
Xi, Vze{l, ..., n}, i.e., maxmin-fairness yields a solution a;® = (xf,..,x^) for 
max(min(xi, . . . , Xn)). A maxmin-fair situation has the property that forall i, xf 
cannot be increased without simultaneously decreasing a;® for some j with a;® < 



Proportional Fairness Another interesting fairness criterion is proportional 
fairness (see [16])®. A proportional fair allocation is the solution to the welfare 

^ For a discussion of maxmin-faimess, see, e.g., [15]. 

® Frank Kelly provided quite a substantial amount of work on proportional fairness, 
which can be found at http://www.statslab.cam.ac.uk/' frank . 
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maximization problem with the welfare function W{ui, Un) = ^og{ui) and 
individualistic utility functions Ui{xi, • • • , Xn) = Xi. 

It has been demonstrated that additive increase and multiplicative decrease 
end-to-end congestion control, assuming best effort FIFO queueing with tail 
dropping inside the network, tends to lead under certain circumstances to pro- 
portional fairness (see [17,18]). Note, however, that this does not necessarily 
hold for real Internet scenarios with TCP congestion control (see [19] and [20]). 

3.2 Current Fairness Concepts and Mechanisms: Macro-Fairness 

Within current computer networks, macro-fairness concepts are applied both to 
medium access control and to data transport. Concerning fair medium access 
control (MAC), mechanisms have been investigated for shared physical network 
links such that on average each sender or receiver gets a fair share of the available 
bandwidth. This issue is of concern both for medium access control in LAN 
environments, and, more recently, for mobile networks (see, e.g. [21]) and for all- 
optical networks (see, e.g., [22]). Many related problems to MAC layer fairness, 
such as for instance fair MAC-layer sharing of a common channel under error 
conditions, are non-trivial and still require further research. 

As for data transport, fairness concepts are relevant for both elastic and real- 
time traffic [8]. For both types of traffic there exist two approaches to provide 
fairness: one is providing fairness by defining appropriate cell/packet schedul- 
ing and queue management algorithms on networking nodes, whereas the other 
one is to achieve fairness by end-to-end congestion control mechanisms. When 
comparing end-to-end fairness mechanisms to queue management fairness mech- 
anisms, it can be noted that the second type results in statistical on-average 
fairness, whereas queueing and scheduling mechanisms allow for a more precise 
control for fair rate allocations and have a shorter response time to adjust to 
new network load situations. 

Note that fairness issues that have been addressed in current computer net- 
working research mostly concern the problem of fair bandwidth distribution 
among competing flows. Fair delay management, fair loss rate distribution, and 
fair jitter control have hardly been addressed at all levels of abstraction, which, 
in our opinion, is insufficient for future applications with fairness requirements. 



Queue Management Mechanisms for Fairness Providing fairness through 
queue management and scheduling mechanisms is an approach that has a high 
impact on the network’s architecture, since the fairness algorithms are imple- 
mented on switches or routers. But when supported, it can provide the most 
efficient, flexible and exact mechanism for fairness. 

For example, in the ATM TM 4.1 specification [23], various bandwidth related 
fairness criteria for the ABR service are defined. 

For the datagram network case, fairness definitions can be implemented at 
packet schedulers on routers. For example, there exists a whole range of fair 
queueing algorithms. For some early examples, see [24] and [25] 



The Fairness Challenge in Computer Networks 215 



The approach of using fair queueing to provide fairness is, for instance, taken 
by the user-share differentiation (USD) scheme [26], which is a proposal for 
differentiated services [27] that ensures that the bandwidth allocated to traffic 
from a user is in proportion to the user’s share negotiated with the service 
provider. Implementations of schemes like USD use extended versions of fair 
queueing algorithms like weighted fair queueing [25] or variations of it (e.g., 
worst-case fair weighted fair queueing [28], self- clocked fair queueing [29], deficit 
round robin [30]). 

Although these queueing algorithms lead to a more fair bandwidth distribu- 
tion among competing and not necessarily all well-behaving flows, they have the 
disadvantage of operating on a per-flow or per-user basis, the scalability, robust- 
ness and feasiblity of which in high-speed networks are still questionable. This 
is, because fair queueing algorithms have been designed for congestion control 
and are usually stateful, as opposed to stateless congestion control algorithms 
such as random early detection (RED) [31] and its variations. 

Other DiffServ approaches to achieve fairness without requiring state at the 
core nodes include [32] and [33,34]. 

End-to-end Fairness Mechanisms Existing end-to-end fairness mechanisms 
are usually implemented by end-to-end congestion control schemes. 

The problem with end-to-end fairness mechanisms is that these mechanisms 
normally only work in a cooperative environment, i.e., if all flows competing for 
network resources are well-behaved. In the Internet, well-behaved means tcp- 
friendly, which is characterized by the property of behaving similar to a TCP 
flow through not sending at a higher data rate than a similar tcp flow in the 
same congestion situation®. Still, it is very questionable if tcp-friendliness is a 
valid assumption in the real world: besides TCP traffic, UDP traffic exists in 
current IP networks and is, for instance, used for real-time flows. The rate con- 
trol algorithms of UDP-applications in practice are not always tcp-friendly and 
therefore harm the overall fairness. In addition, there exists the risk of malicious 
TCP implementations that are on purpose not tcp-friendly in order to increase 
their individual throughput on the cost of regular TCP flows. There are currently 
no restrictive control mechanisms implemented that punish those flows that are 
not tcp-friendly, unless this punishment is done by queue management on net- 
work nodes as described above, which implicitly influences the type of fairness. 
One possible approach to cope with this problem is to identify tcp-unfriendly 
flows at the routers and punish them with appropriate dropping policies. For an 
interesting discussion of solutions to the tcp-unfriendliness problem see [32] . 

Multicast Concerning network level fairness, multicast packet or cell delivery 
introduces an additional level of complexity. Following [36], we distinguish be- 
tween inter-fairness and intra-fairness. Inter-fairness means that multicast flows 
should exhibit fair behaviour compared to other, unicast flows. Intra-fairness re- 
lates to fairness inside the multicast scenario, e.g., different multicast sessions 



For a detailed discussion on tcp-friendly applications and protocols see [35]. 
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among the same group of senders and receivers should exhibit fairness. We follow 
the authors of [37] in pointing out that is still not entirely clear how inter-fairness 
among congestion-controlled multicast and TCP traffic should be defined: should 
a multicast session to n receivers get the same share as one TCP connection or 
as n TCP connections ? 

Another problem with multicast feedback control in general is the loss path 
multiplicity (LPM) problem [38], i.e., a packet can be lost on any of the end-to- 
end paths in the multicast tree. If the sending rate is controled by loss indications 
from all receivers, there is the problem that with an increasing number of such 
paths the sender will further and further reduce its sending rate until it eventu- 
ally might cease sending. In [38], it has been shown that with such a scheme of 
controling the sending rate, maxmin fair sharing of bandwidth between unicast 
and multicast traffic is impossible to achieve due to the LPM problem. 

For multicasting real-time traffic as generated by audio/video applications, 
layered multicast (see, e.g., [39,40] or [41]), is a very interesting mechanism to 
effectively use network resources in a scenario with heterogenous receivers. Still, 
fairness issues for layered multicast have only been investigated at a very basic 
level and open a whole new field of future research. In addition, layered multicast 
adds another whole new level of complexity to the fairness problem if the different 
multicast layers operate at different sending rates^. 

3.3 Fair Applications: Micro-Fairness 

All existing concepts, mechanisms and examples of macro-fairness are defined 
by individualistic utility functions (see Section 3.1), meaning functions of type 
Ui{x\, . . . ,Xn) = f{xi), i.e., the utility of a user i only depends on the resource 
bundles he/she receives (xi), but not on any other resource bundle xj, with j i. 

We believe that especially for highly competitive applications, maximizing 
welfare with individualistic utility functions cannot correctly represent the se- 
mantics of the desired fairness. 

For that reason, we like to present an example fairness definition for micro- 
fairness using what we call group- constrained utility functions, i.e., utility func- 
tions of type Ui{xi, . . . , Xn) = f{xi, . . . ,Xn), where / depends of at least one 
resource bundle Xj with j i, where xi, ... ,Xn are the resource bundles re- 
ceived by the individual communication group members. 

The example, which we have called group-delay constrained utility, can be 
applied to real-time trading or real-time games scenarios, i.e., competitive com- 
munication scenarios, where the semantics of the application require each par- 
ticipant of a communication group to perceive at most the same average delay 
than the other users, in order to be able to compete in a long term. 

For the definition of this group-delay constrained utility, we take the follow- 
ing individualistic utility function for delay: Ui{delayuser i, ■ ■ ■ ,delayuser n) = 

^ A very thorough approach to define and examine multi-rate multicast (inter- and 
intra-) maxmin fairness has been provided in [42]. 
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f {delayuser i)- In order to represent this competition, we extend the utility func- 
tion with group constraints that represent the dependency of a user’s utility on 
the delay received by the other group members: for all users te{l, . . ,,n} of a 
communication group 

Uii^dclciyuser 1; ■ ■ • 5 dcldyuser n) — • ■ ■ 

_ r 0 3je{l, . . . , n} : avg ddayuser i > avg ddayuser j 

\ f{delayi) otherwise 

where f is the individualistic utility function described above. 

The maxmin fair solution using this new type of utility function leads to strict 
equality concerning the average delay: Vi,j : avg delayuser % = a-'^^g delayuser j, 
i.e., we have strict (and identical) upper and lower bounds for all users of that 
communication group. The extension we have introduced represents the strong 
effect of the competition inside the communication group: all users only consider 
to be fair an exactly equal situation with respect to the received average delay. 

Note that in this example, QoS mechanisms for strict delay bounds could 
be used to achieve the fair resource distribution, once the value of the delay 
bound is determined according to the fairness definition using this group-delay 
constrained utility function. 

In client-server application scenarios, the abstract parameter on which the 
utility depends is the parameter response time, which encompasses the two-way 
transmission delay and some processing delay. In that case, a common mecha- 
nism for approximating micro-fairness is synchronization via transaction control. 
For instance, in auction scenarios, synchronization mechanisms are necessary for 
a fair treatment of the participants: all auctioneers want to get at least the same 
chance to bet during a certain time slot. In such a scenario, a fair mechanism 
is to collect (and acknowledge) the bets of all participants in a first step, then 
to evaluate the synchronized bets and to announce the resulting highest bet as 
input to the next round. Obviously, such auctioning mechanisms are neither very 
time efficient nor can they provide exact fairness. 

We leave it as a remaining challenge for future research in fairness to provide 
more efficient mechanisms for micro-fairness. 

4 Conclusion 

We hope to have demonstrated that even though specific fairness issues concern- 
ing computer networks have already been investigated to some extent, there is a 
vast amount of interesting and challenging work left to be done. We would like 
to motivate the reader to participate in further investigation of the wide range 
of interesting and challenging open topics in the field of fairness in computer 
networks by directing him/her to the extensions of fairness we believe to be 
most important for future research: extension of definition, extension to other 
QoS parameters and extension to new applications. 
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4.1 Extension of Definition 

Currently, fairness is mainly defined for unicast cell or packet delivery. Other 
types of delivery, such as multicast, broadcast or anycast [43] require an exten- 
sion of the fairness definition. Examples for multicast include the aspects of 
inter-fairness and intra-fairness for best-effort multicast, congestion-controlled 
multicast, reliable multicast, and layered multi-rate multicast. Also, solutions 
to the LPM problem for different fairness criteria and multicast scenarios with 
multiple senders and dynamically joining and leaving receivers are at an early 
stage and worthwhile further investigation. In all of these topics, research has 
just begun. 

4.2 Extension to Other QoS Parameters 

The extension to other QoS parameters means to not only apply fairness concepts 
to bandwidth distribution problems, but also consider the fair management and 
control of loss rate, delay and delay jitter. We believe that especially fair delay 
management and fair jitter control have to be considered for future applications 
that require fairness as part of service quality. 

4.3 Extension to New Applications: Micro-Fairness 

Extension to new applieations deals with the aspect of micro- fairness: the cur- 
rent macro-fairness concept in computer networks has to be extended up to the 
application level, i.e., fair applications should be supported. We believe that 
such application-semantic fairness is best supported if the communication chan- 
nel provides the necessary degree of fairness. An integral and comprehensive 
approach for fairness provisioning, especially based on non-individualistic utility 
functions, is needed and available as a new field for future research in fairness. 
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Abstract. Mechanisms for achieving fair bandwidth sharing have been 
a hot research topic for many years. Protection of well-behaved flows 
and possible simplification of end-to-end congestion control 
mechanisms are the ultimate challenges driving the need for research in 
this area. Rate-based schedulers, such as weighted fair queuing, require 
per-flow state information in the routers and in addition a mechanism to 
determine which packets to drop under congestion. Therefore they are 
too complex to be implemented in high-speed networks. To address this 
issue many other schemes have been proposed among them core 
stateless fair queuing (CSFQ) [1], constitutes the most revolutionary 
approach. In this paper we propose an edge-marking scheme that 
achieves fair bandwidth allocation by marking packets belonging to the 
same flow with different colors, i.e. layers, according to a token bucket 
scheme. The interior routers implement a Random Early Detection 
(RED) [12] based buffer acceptance mechanism that is able to drop 
packets based on their color. This buffer acceptance mechanism 
estimates the layer that is causing the congestion and drops packets 
according to a RED drop function. Using simulations we prove that this 
mechanism is stable and achieves a fair distribution of the bottleneck 
bandwidth. 

Keywords. Fair bandwidth sharing, differentiated services, congestion 
control 



1 Introduction and Statement of the Problem 

Today s Internet only provides Best-Effort Service. Traffic is processed as quickly as 
possible, but there is no guarantee as to timeliness or actual delivery. With the rapid 
transformation of the Internet into a commercial infrastructure, demands for different 
levels of service have rapidly developed. It is becoming increasingly evident that 
several service classes are necessary each with its own fields of application. 

So far, the Internet Engineering Task Force (IETF) has proposed many service 
models and mechanisms to meet the demand for Quality of Service (QoS). Among 
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them are the Integrated Serviees/RSVP model [2, 3] and the Differentiated Services 
model [4, 5]. 

The Integrated Services model is characterized by resource reservation. For real- 
time applications, before data are transmitted, the application must first set up a path 
and reserve resources. This is the mission of the Resource ReSerVation Protocol 
(RSVP). 

Although Integrated Services is attractive from an application point of view, there 
are some issues preventing its wide deployment. 1) The amount of state information 
increases proportionally to the number of flows. This places a huge storage and 
processing overhead on the routers making this architecture not scalable for the 
Internet core. 2) Ubiquitous deployment is required since all routers along the path 
must support IntServ to set up a connection. 

Differentiated Services, on the other hand, re-use the Type of Service (ToS)-byte 
of the IP header [4] to indicate the QoS requirements of the packet. Differentiated 
Services is significantly different from Integrated Services. First of all, the service is 
allocated in the granularity of a class, reducing the amount of state information thus, 
proportional to the number of classes and not to the number of flows as it was the case 
for IntServ. Therefore Differentiated Services provides a scalable QoS solution to ISP 
networks. Secondly, sophisticated classification, marking, policing and shaping 
operations are pushed to the boundary. Interior routers need only to implement 
Behavior Aggregate (BA) classification, i.e. on basis of the DS field in the packet, 
and the appropriate scheduling- and buffer acceptance algorithms. A third advantage 
is that all routers do not necessarily have to implement DiffServ, which makes 
incremental deployment possible. 

Differentiated services do not define services as such, but rather Per Hop 
Behaviors (PHBs). These are intended to allow Internet service providers complete 
freedom to construct, from PHBs, the intra-domain services meeting their customers 
needs. In this paper we define a fair bandwidth allocation service, to fairly distribute 
the bandwidth between all active flows. There are many ways to achieve this type of 
service, but three main types of mechanisms can be distinguished: 1) per-flow 
queuing mechanisms [6, 7], where each flow is queued in its own FIFO queue. 2) Per- 
flow accounting and dropping mechanisms, e.g. FRED [8], which maintains per-flow 
information and drops packets based on this information in times of congestion. 3) 
Mechanisms that achieve fair bandwidth allocation, without keeping per-flow 
information in each router [1, 10]. The latter ones are useful in high-speed backbone 
networks where the per-flow state information would result in scalability problems. 

In this paper we focus on the third type of mechanisms where we follow a similar 
methodology as in [10], but by re-defining the buffer acceptance mechanism to obtain 
more stable queues in the core routers. We achieve fair bandwidth allocation by 
maintaining per-flow information in the edge routers and running a suitable 
congestion control mechanism to drop packets in the core routers. The simulation 
results in this paper prove that this scheme is able to achieve fair bandwidth 
allocation, and that the mechanism can achieve approximately the same or better 
results as Core Stateless Fair Queueing (CSFQ) [1] and Rainbow Fair Queueing 
(RFQ) [10]. 
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2 Mechanisms Needed for Fair Bandwidth Allocation 

In this section we describe the marking scheme configured at the edge of the network, 
i.e. where the different flows are colored, as well as the buffer acceptance algorithm 
needed in the interior routers to achieve fair bandwidth allocation. 



2.1 Marking Scheme 

We use multiple token buckets in order to obtain multiple drop precedences, for 
coloring the packets of a certain flow, based on the arrival rate. These monitor a 
packet stream and mark it with one of these colors. Packets are marked with N 
different colors starting from 0 (the lowest layer) up to N 1 (the highest layer). Each 
drop precedence is supplied with its own token bucket having an Insertion Rate (IRi) 
and a Burst Size (BSj) as parameters. One could set the IRs different for each color, 
starting from a small value, i.e. thin layer, and increase the thickness of the layers, as 
in [10]. In this paper we will not consider this case, and we will use the same width 
for each layer. Based on simulations and analyses, we use the following heuristic for 
setting the IRs for a certain flow entitled to w percent of spare bandwidth. 

7R Vw.— , 

N 

where P is the maximum flow rate in the network and N the number of colors used. 

All token buckets are initially full. Thereafter the contents are updated IR times per 
second by one if the bucket is not full. When a packet arrives it checks the buckets in 
an increasing fashion, i.e. first the lowest drop precedence, then the second, etc. From 
the moment that a bucket is encountered with enough tokens available to 
accommodate the packet, the amount is removed and the packet is marked with the 
corresponding color. All packets that exceed the second but last bucket contents are 
colored with the last color (N - 1). We denote this marker by multi-rate Multi Color 
Marker (mr-MCM) and it is a generalization of the tr-TCM [11], where the number 
of colors equals N. 



2.2 Buffer Acceptance Algorithms 

In this subsection RED, n-RED will be explained. Because there are some 
disadvantages with n-RED when a large number of colors is used, we define a new 
mechanism, which we denote by MC-RED. 

RED and Its Extension to n-RED 

In RED [12] there are four configurable parameters. There are two buffer thresholds, 
miuth and max,ij, between which probabilistic packet drops occur, maXp that dictates 
the drop probability at max,ij, and finally a low-pass filter weight Wq, used in 
calculating the average queue length, i.e. avg & (1 %wq)#avg^ wq^ , where q is 

the instantaneous queue size. RED is an active queue management scheme, which 
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tries to keep the overall throughput high while maintaining a small average queue 
length, and tolerates transient congestion without causing global synchronization of 
TCP connections. The targets of control in a RED queue are the average queue 
lengths through min,h, maxth and maXp, and the degree of burstiness reflected on the 
average queue length through Wq. Except for the low-pass filter weight, a RED queue 
configuration can be represented as in Fig 1. When the average queue occupancy is 
below the minimum threshold, no packets are dropped. When the average queue size 
exceeds the minimum threshold, packets are dropped with an increasing probability. 
When the average queue size exceeds the maximum threshold all arriving packets are 
dropped. In [13] an improvement is discussed: a further linear continuation up to 1 of 
probabilistic dropping, until the avg. queue occupancy reaches 2 x maxth [13]. 




Fig. 1. Drop probability in function of the average queue length in RED 

In extending RED for service differentiation (known as Weighted RED or n-RED), 
the three parameters that directly control the average queue lengths are used. The idea 
is to have n sets of {min,h, max,h, maXp} for n drop precedences indicated by the DS 
field, where one or more parameter values differ between the classes. Eventually, the 
differences in parameter values results into different drop probabilities given an 
average queue length. When we would apply the n-RED algorithm in our fair 
bandwidth allocation approach, we would have to set the thresholds as in Fig 2., i.e. 
minimum threshold of color x 3 maximum threshold x -f 1 ; since we would have to 
discard all packets of layer x -f 1 before starting to drop packets from layer x. An 
example for n = 3 with average queue length L is given in Fig 2. In this case all 
packets with drop precedence 0 would be accepted while packets with drop 
precedence 1 will be dropped probabilistically and all packets from layer 2 are 
discarded. 
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Fig. 2. Drop probabilities for multiple classes in n-RED 

Multi-color Random Early Detection (MC-RED) 

As the number of parameters for n-RED increases linearly as a function of the number 
of colors, i.e. 3 more are added for each color. This makes the implementation and the 
difficulty of setting the parameters for each class extremely complex. In order to keep 
the advantages of discriminating packets with a higher drop precedence from those 
with a lower one, and the simplicity of RED, we propose the MC-RED buffer 
acceptance mechanism. This mechanism operates in the same way as RED does, but 
has only one extra parameter that varies in time, i.e. it adjusts itself to find a stable 
operation point. We have called this extra parameter the Drop Color (DC). Every 
packet with color below DC will be accepted, and every packet with color above DC 
will be dropped deterministically. When a packet with color DC arrives and the avg. 
queue occupancy is between the minimum - and the maximum threshold, the packet is 
dropped probabilistically, i.e. RED applies the congestion avoidance algorithm for the 
packets with color DC. 

How do we calculate the DC value? 

The value of DC determines the color that will be dropped probabilistically. Because 
the algorithm has to search for this value, a change in the DC value should in fact 
imply that the drop probability doesn t jump up and down, but changes smoothly. 
This is achieved by appropriately configuring the RED maXp value (see below) and by 
carefully re-adjusting the avg. queue size when changing the DC value. The exact 
behavior is described below. 

When the average queue size exceeds the maximum threshold, the DC value is 
decreased and the average queue size is set equal to the minimum threshold. The idea 
is that in this case we are not dropping enough packets to reach a stable operation 
point, and thus more packets must be dropped, but only in increasing order of their 
drop precedence (discriminating packets with a higher drop precedence with respect 
to packets with a lower one). On the other hand, when the average queue occupancy 
drops beneath the minimum threshold we increase the DC value and set the avg. 
queue size equal to the maximum threshold. The idea is opposite to the above one, i.e. 
we drop to harshly so we should lower the amount of drops. 
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How do we set the parameters for the overall RED mechanism? 

The parameters for the minimum and maximum threshold, as well as the weight in the 
low-pass fdter to calculate the average queue occupancy, can be set in the same way 
as for RED. Since the algorithm searches for the optimal drop probability (stable 
operation point), a smooth increase to 1.0 (cover as much drop probabilities) is best. 
This means that we set maXp to 1.0. As already briefly mentioned we could chose a 
different maXp value and take the approach as discussed in [13]. This would be a 
sensible choice when the DC value equals the lowest color. The choice of this setting 
enables us to change the average queue size, when the DC value is changed (see 
below). 

Including a slope measure. 

To speed up the convergence of the algorithm to the correct DC value and to prevent 
instability after a re-scaling action of the average queue length, the calculation of the 
average queue isn t done on every packet arrival. We consider two cases in which the 
average queue size doesn t have to be updated: 

Case 1. As the average queue size is calculated using a low-pass filter, it 
encompasses the possibility to allow burst in the network. However when congestion 
lasts the instantaneous queue size will keep on growing, and this will be represented 
by an average queue that grows beyond the maximum threshold. From that moment 
on the DC value will change and the average queue size will be adjusted. It will take a 
time for the increased number of drops to be reflected in the instantaneous queue 
occupancy and consequently in the average. If this is the correct DC value, the 
instantaneous queue will decrease and finally the average will remain stable between 
the minimum and maximum threshold. Thus from the moment we see that the 
instantaneous queue size is falling (slope < 0) we do not update the average queue 
size as long as the instantaneous is above the maximum threshold. This behavior is 
illustrated in Fig 3. 

Case 2. Similar to case 1, but when the instantaneous is below the minimum 
threshold and the slope of the instantaneous queue occupancy > 0. This behavior is 
illustrated in Fig. 4. 

An important property of a buffer acceptance algorithm is that it is able to detect 
and respond to long-term congestion, by dropping packets while handling short-term 
congestion by queuing packets. This implies that a smoothing or filtering function is 
used that monitors the instantaneous congestion level and computes a smoothed 
congestion level. In addition it must be insensitive to short-term traffic characteristics 
of micro-flows, therefore some randomness must be present in the dropping function. 
A last, but important property, is that the congestion indication feedback must be 
gradual rather than abrupt, to allow the overall system to reach a stable operation 
point. 

In this paper we take an approach where we alter RED [12], to make it multiple 
color aware and where it is able to discriminate packets with a higher drop precedence 
from packets with a lower one, but maintaining the simplicity of the original 
mechanism. Therefore we ensure that we are conforming with the requirements stated 
above. In [10] a different buffer acceptance algorithm is explained, it is important to 
notice that the major difference between the both is the probabilistic dropping 
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behavior incorporated in our buffer acceptance algorithm, which also implies that a 
certain portion of a layer is discarded instead of whole layers. This is important for 
TCP traffic as it ensures that a high throughput is achieved, and global 
synchronization is avoided. 



do not update 

queue size (packets) average queue size 




— average queue size 

Fig. 3. Instantaneous above max and 
decreasing after a DC + 1 adjustment 




— average queue size 



Fig. 4. Instantaneous below min and growing 
after a DC 1 adjustment 



Summarizing the algorithm in pseudo code we get; 



DC = max_color; 

for (each packet arriving at the queue) { 
update the avg. queue size if necessary (Fig 3. and Fig 4 . ) ; 
if (the avg. queue size has been updated) { 
if (avg. queue > maximum threshold) { 

DC = max (DC - 1, 0 ) ; 

avg. queue size = minimum threshold; 

} 

else if (the avg. queue size < minimum threshold) { 

DC = min (DC + 1, max_color) ; 

avg. queue size = maximum threshold; 

} 

} 

if (color (packet) == DC) 
probabilistically dropping like RED; 
else if (color (packet) < DC) 
accept the packet; 
else if (color (packet) > DC) 
discard the packet; 

} 

Where : color (packet) return the color of the packet 
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3 Simulation Model and Results 



In this section, we present the simulation model and results. In our simulations we 
consider a scenario known as the generic fairness configuration (GFC), which is 
depicted in Fig 5. In this scenario we have 10 users connected to an ISP, called Aa, 
Ab, Ba, Bb, Ca, Cb, Xa, Xb, Ya, and Yb. All users are equipped with a mr-MCM that 
marks the traffic stream with 10 different colors (0 .. 9). The IRs for the bandwidth 
differentiation for the different users are shown in the figure. The shaded region in the 
figure represents the ISP to which all users are connected. All access links, i.e. the 
links from the routers of the users towards the ISP, are 34 Mbps with a fixed delay of 
2.5 ms. The bandwidths in the domain of the ISP are shown in Fig 5. and they have a 
fixed delay of 10 ms. They are chosen such that the Ba, Bb, Xa and Xb have their 
bottleneck in the first router; the sources Aa, Ab, Ya, Yb have their bottleneck in the 
third router, whereas the Ca and Cb sources are bottlenecked in the second router. 

Every router is equipped with MC-RED. The total queue size is equal to 800 x 
1500 bytes, the minimum threshold is set at 200 x 1500 bytes and the maximum at 
500 X 1500 bytes. The maximum drop probability is set to 1.0 and a weight of 0.002 
is used for the low-pass filter used for calculating the average queue size. The arrows 
in the figure represent the direction in which the data packets flow, the 
acknowledgements flow in the opposite direction. 
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Fig. 5. Simulation scenario 

Our simulation lasts 105 seconds and during the first 5 sec no simulation results are 
gathered to reduce start-up effect. We consider a mix of UDP and TCP sources. The 
sources Ba, Xa, Aa, Ca, Ya consist of 20 TCP sources sending packets with a MSS of 
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1460 bytes. The other sources are UDP sources and are sending at a rate of 30 Mbps; 
all UDP sources, except the sources Bb and Ab, start sending from the beginning. The 
latter sources only start sending data packets after 50 seconds, this to suddenly have 
varying load at the routers. The buffer acceptance algorithm should be able to adapt to 
guarantee a fair throughput at all times, without comprising the performance. We 
compare these results with the results of CSFQ and RFQ. 





(a) MC-RED DC value 



(b) MC-RED Queue Size (bits) 




(c) RFQ DC value 




(d) RFQ Queue Size (bits) 



Fig. 6. Color thresholds and queue size in the first bottleneck router 



In Fig. 6., 7. we plot the DC value as well as the queue occupancy as a function of 
time for MC-RED and RFQ. From these figures (Fig 6. and Fig. 7.), it is clear that 
MC-RED has an extra advantage over RFQ. As stated in [10] the queue behavior of 
RFQ is unpredictable and can have huge oscillations. MC-RED on the other hand 
eliminates this because it is able to discard a certain portion of a layer, i.e. the layer 
causing the congestion (DC remains stable instead of oscillating). We also see that the 
MC-RED mechanism is able to adjust its DC value when the UDP sources become 
active during the second part of the simulation and that it reaches rather quickly a 
stable operation point. This not the case for RFQ, that oscillates between the layer 
causing the congestion, the underlying layer and the layer above. By doing this the 
RFQ mechanism drops sometimes too many packets and compensates for the loss by 
accepting too many packets afterwards, never reaching a stable operation point. This 
is why the DC value for RFQ fluctuates so much. Only when the arrival rate of colors, 
i.e. the sum of the arrival rates of a certain portion of layers, equals the link rate, the 
reduction is satisfactory. In this case the RFQ empties the queue and forces a sending 
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rate equal to the link rate (Fig 7.) Although RFQ aehieves a fair distribution, it is 
however not the most optimum way of dropping colored packets. As future layered 
applications might color their packets with a color to indicate their importance, the 
dropping of packets belonging to a lower layer should be minimized. The reason for 
this in that since the arrival of a packet with color x + 1, might be useless without the 
corresponding packet belonging to the underlying layer, i.e. colored with color x. As 
an example we could take layered video over the Assured Forwarding (AF) PHB. 
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Fig. 7. Color thresholds and queue size in the second bottleneck router 



In Fig. 8 the expected throughput is calculated as follows: 

Since in the beginning the UDP sources Bb and Ah are not active, the bottleneck 
router 1 is shared between Ba, Xa and Xb. Where Ba, Xa and Xb get 3/7, 3/7 and 1/7 
of the bandwidth, respectively. After 50 seconds simulation time the Bb source 
becomes active and the bottleneck bandwidth in router 1 is shared according to 3/8, 
1/8, 3/8 and 1/8 for Ba, Bb, Xa and Xb respectively. This means, for example, that on 
average the expected throughput for Ba equals 0.5 x (3/7 + 3/8) x 30 Mbps = 12.05 
Mbps. For sources Aa, Ah, Ya, Yb a similar reasoning is used, but then a bottleneck 
of 25 Mbps is taken into account. For sources Ca and Cb we first calculate the spare 
bandwidth in router 2 and divide this bandwidth according to the weights. This means 
since Ba and Bb are bottlenecked in the first router and Aa and Ah are bottlenecked in 
the third router the spare bandwidth in router 2 is equal to 50 Mbps minus the 
expected throughput of Aa, Ah, Ba and Bb. 
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Fig. 8. Bandwidth differentiation for each aggregated flow 

From these simulations we clearly see that our scheme is able to provide fair 
bandwidth allocation close to that of core stateless fair queuing and better than RFQ, 
which has an unstable queue behavior. 



4 Conclusions 

In this paper we presented a scheme for achieving approximate fair bandwidth sharing 
without per-flow state. The mechanism achieves fair sharing by marking packets of a 
same flow according to a token bucket scheme at the edge of the network. Within the 
core routers a color-aware buffer acceptance algorithm (MC-RED) is implemented. 
This buffer acceptance mechanism is based on RED [12] enhanced with a color 
threshold DC that aims at estimating the layer causing the congestion and accepting 
all packets marked with a color smaller than this value, whereas all packet above this 
color are dropped. For the packets with color DC, RED is deployed. Because of the 
structure of the layers the discarding is approximately fair. The advantage of this 
scheme with respect to CSFQ is that the implementation complexity is lower and it 
doesn t require an extra IP-option. With 8 bits we can already support 256 colors. 
The RFQ mechanism is also able to provide fair bandwidth allocation, but due to the 
fact that it accepts more during some part of the simulation and correcting this by 
accepting not enough during another part. The both cancel each other out, achieving 
thus the fair distribution. The MC-RED mechanism corrects this error by introducing 
a probabilistically dropping function, which is able to avoid global synchronization 
and to avoid under-utilization of the bottleneck routers. 
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Abstract. One building block to provide service quality is queueing 
disciplines such as Random Early Detection (RED). Besides the topic 
of service differentiation (e.g.. Differentiated Services), data flows within 
a service class are expected to receive the same service quality which is 
expressed in the service quality metric ‘fairness’. This document evalu- 
ates traffic phase effects when using RED gateways in conjunction with 
UDP-based constant bit rate sources. These effects can lead to an unfair 
division of the available bandwidth among CBR data flows with the same 
bandwidth share. It is shown that the introduction of randomization may 
help to improve the fairness. 



1 Introduction 

The Internet Engineering Task Force has recommended active queue manage- 
ment algorithms such as Random Early Detection (RED) [3] for usage in IP 
gateways to achieve congestion avoidance while keeping a high utilization of the 
underlying link [1]. RED gateways control the average queue size by dropping (or 
marking) single packets “early”, i.e., before the maximum queue size is reached. 
This way, traffic sources are informed about the raising level of congestion. This 
congestion avoidance method relies on the ability of the transport protocol to 
interpret a packet drop as a sign of congestion so that it reduces the transmission 
rate accordingly, such as TCP does. Further objectives of RED are the avoid- 
ance of global synchronization to keep a high link utilization and the avoidance 
of a bias against bursty traffic (cf., [T, 9] for an evaluation of these objectives). 
Additionally, RED gateways are designed to be able to maintain an upper bound 

* This work has been supported partially by a grant from the German Ministry for Re- 
search and Education (BMB+F) within the project “Communication and Mobility 
by Cellular Advanced Radio” (COMCAR). Within this project, we are investigating 
high qnality IP traffic to and from vehicles in the wide area. The work described 
here belongs to onr studies on how to improve quality of service in a mobile wireless 
environment. 
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of the average queue size even when the transport protocol on the traffic source 
does not reduce the transmission rate on congestion notification. An example 
for such a congestion insensitive transport protocol is UDP. 

Since the amount of UDP-based traffic in the Internet is assumed to grow 
in the future, the last mentioned objective of the RED design becomes more 
important. When an RED gateway is configured, it may not be known in advance 
if UDP or TCP traffic is dominating. 

Furthermore, many users of the Internet are attached to the Internet via 
circuit-switched lines with a limited maximum bit rate (e.g., based on a modem). 
To accommodate these users, many coders exists for audio [4, 6] or video [5] which 
generate constant bit rate traffic. 

This document evaluates the usage of RED gateways in conjunction with 
UDP-based constant bit rate (CBR) traffic ffows. It focuses on fairness issues 
between UDP-only traffic. 



2 Random Early Detection Gateways 

2.1 Basic Operation 

On packet arrival, RED gateways [3] must execute two different algorithms: 

1. One algorithm for calculating the average queue size avg which is performed 
by using a low-pass filter with an exponential weighted moving average 
method, using Wq as the weight. 

2. Another one for calculating the probability of dropping a packet “early”, 
i.e., before the queue overflows. The probability depends on two factors: 

(a) The degree of congestion, represented by the value of avg. Early packet 
drop is performed only when avg is between mirith and maxth- When 
avg is above maxth, all incoming packets are dropped permanently. 

(b) A random factor which ranges linearly from 0 to maxp as avg ranges 
from mirith to maxth- 

Figure 1 depicts the value of the drop probability as a function of the average 
queue size. 

The objective of the first algorithm is to detect the current degree of congestion 
while allowing a certain degree of burstiness which is determined by Wq. With the 
second algorithm, the RED gateway decides whether to drop the incoming packet 
or not. The random factor is introduced to avoid the global synchronization of 
congestion notification and to avoid a bias against bursty traffic. 

2.2 RED and UDP-Based Traffic 

It is assumed that in general UDP-based sources do not implement a congestion 
avoidance mechanism to reduce their transmission rate due to packet drops. 
Depending on the amount of UDP-based traffic arriving at the RED queue and 
the value of maXp, the average queue size cannot be kept within the mirith, 
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drop 

probability 




Fig. 1. The probability of early drop 



maxth interval. For example, if a value of maXp = 0.02 is used (as recommended 
for congestion notification of TCP sources [3]), a constantly arriving amount 
of traffic being 102% of the bottleneck link capacity or above will keep the 
average queue size at a level of around maxth ■ This is due to the fact that only a 
maximum of 2% of the arriving packets is dropped early. If maXp is set to higher 
values, for example, 0.25, 0.5, or 0.75, an arriving amount of traffic being 133%, 
200%, or 400% of the bottleneck link capacity or above will compensate the early 
drops. That means that in spite of the early drops, the amount of traffic trying 
to cross the bottleneck link is still larger than the link capacity. Only a value of 
maXp = 1 cannot be compensated by a higher amount of arriving traffic. 

For all these cases, the influence of the early drop algorithm is limited since 
the RED queueing discipline changes periodically between the early-drop phase, 
where packet drops are based on a random factor, and the permanent-drop 
phase, in which randomization does not play any role. Therefore, effects of global 
synchronization may be reintroduced again which lead to an unfair division of 
bandwidth. The remainder of this paper addresses the issue of the fairness of 
RED in a scenario with UDP-based CBR sources. 

3 Simulation Scenario 

Simulations are performed with the network simulator (ns) [8]. Figure 2 depicts 
the simulation topology used for the following investigations. 

50 nodes send CBR traffic via UDP over a single bottleneck link with an 
RED queuing discipline to 50 receiving nodes. The delay on all links does not 
have an influence on the simulation results since UDP has no congestion control 
mechanism which depends on the round-trip time and, thus, on the delay. 

In [3] it is proposed to set the RED parameters maxth to at least twice the 
value of minth and Wq to larger than 0.001. According to these rules, RED is 
configured with minth = 30 packets, maxth = 60 packets, and Wq = 0.002. 
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Sending nodes Receiving nodes 

Fig. 2. The simulation topology 



However, since there is no bursty traffic within the simulations, the setting of 
these three parameters is less important, maxp is set to 0.02 (a recommended 
value for TCP-based traffic sources [3]) so that the influence of the early drop 
algorithm is low at first. There is no recommended value for maXp in case of 
a congestion insensitive transport protocol such as UDP. The buffer size of the 
bottleneck link is 180 packets so that packet drops due to overflow of the packet 
queue never occur. 

For all simulations, there are five different CBR traffic generators on each of 
the sending nodes which send UDP packets with 16, 20, 30, 50, and 100 kbps. 
There is a total amount of 10,800kbps arriving at the bottleneck link corre- 
sponding to an overload of 135%. The packet size is constant 125 bytes for each 
data flow. 

Each simulation lasts 100 seconds. The traffic generators start sending in an 
arbitrary, but deterministic order (using a random number generator with the 
same seed for all simulations) within the first second of all simulations. The five 
generators on each node start simultaneously. The measurements start after 4 s 
of simulation time so that both, the actual queue size and the weighted average 
queue size of RED have already reached a stable state. 

4 Fairness of RED with UDP-Based CBR Traffic 

The traffic generators create data flows with a total bit rate of 10,800kbps which 
have to compete for the available 8 Mbps on the bottleneck link. Intuitively, it 
is expected that each data flow gets a fair share of roughly about = 74.1% 
of its source data rate. Table 1 shows the result of the first simulation run. 
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Table 1. Statistical simulation output for the different flows 



Data rate 
sent [kbps] 


Mean of the received 
throughput [kbps] 


Throughput 
share [%] 


Standard 
deviation [kbps] 


Minimum | Maximum 
data rate [kbps] 


16 


11.7 


73.3 


0.9 


9.6 


13.2 


20 


14.5 


72.6 


2.0 


10.8 


18.0 


30 


21.9 


73.0 


1.7 


18.9 


25.3 


50 


37.6 


75.1 


3.1 


31.9 


42.8 


100 


74.8 


74.8 


2.6 


68.0 


78.3 



The arithmetic mean of the received throughput, measured at the traffic 
sinks, is close to the expected share of 74.1%. However, the throughput varies 
significantly between single data flows. 

For example, for the data flows from the 16 kbps traffic generators, one par- 
ticular data flow achieves a throughput of only 9.6 kbps while another data flow 
achieves 13.2 kbps, 37% more than the former. This situation even does not 
change if the simulation time is extended to 1,000 seconds. The variation coef- 
ficient, which is the standard deviation a divided into the arithmetic mean, is 
about 0.03-0.14 (3-14% of the mean) for this particular simulation, too high to 
be negligible. Further simulations with different randomly chosen start times for 
the traffic generators show an arithmetic mean of about 10% for the variation 
coefficient. 

As a result, RED seems to be no longer fair when the average queue size 
oscillates around maxth- 

5 Traffic Phase Effects 

Effects of traffic phases [2] have been one of the reasons to develop queueing 
disciplines based on random drop. Since the random factor of RED has a small 
influence only in case of maxp = 0.02 and UDP-only traffic (cf. Sect. 2.2), the 
temporal interaction of the CBR traffic generators with the average queue size 
is considered a candidate to be responsible for this effect. This section performs 
an analysis of this interaction. 



5.1 Periodicity of the Average Queue Size 

Figure 3a depicts a sample of the average queue size and the current queue size. 
The y-axis shows the current number of packets in the queue and the calculated 
average queue size. Figure 3b depicts a cutout from figure 3a showing the average 
queue size only. 

The average queue size oscillates around maxth (being 60 packets in the 
simulations), as it can be seen from figure 3b. This is because packets have a 
certain probability to be queued if the average queue size is below maxth (the 
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Average and current queue size sample 




Average queue size sample: Cutout [9. .9.015s] 




time [s] 

3b 



Fig. 3. Sample of the average and current queue size 



so-called ‘early-drop phase’). In this phase, the current queue size raises. As soon 
as the average queue size is above maxth , all incoming packets are dropped (the 
so-called ‘permanent-drop phase’) and the current queue size becomes smaller. 
Since the average queue size is calculated by a low-pass filter, it raises and falls 
much slower than the current queue size. 

The analysis of the simulation output has shown that the early-drop phase 
lasts about 9.1ms on average (with a standard deviation of 5.3 ms) and the 
permanent-drop phase about 3.2ms (with a standard deviation of 1.7 ms). Thus, 
the total period of the average queue size oscillations is 9.1 -I- 3.2 = 12.3 ms. 

5.2 Synchronization of Periods 

This section evaluates the interaction of the average queue size period with the 
periodicity of the CBR traffic generators. Table 2 shows the inter-packet time 
for the different CBR traffic generators. 



Table 2. Inter-packet times for the different CBR sources 



Data rate [kbps] 16 


20 


30 


50 


100 


Inter-packet time [ms] 62.5 


50.0 


33.3 


20.0 


10.0 



There is a good chance of a synchronization between the oscillation period 
of the average queue size and the periodicity of the CBR traffic generators. 
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Therefore, a high variation coefficient is expected from the 16 kbps CBR sources, 
since their period is nearly five times as large as the average queue size oscillation 
period (5*12. 3ms=61. 4ms vs 62.5ms). The same holds for the 20kbps sources 
(4*12.3 ms = 49.2 ms vs. 50 ms) and for the 100 kbps sources (4*12.3 ms = 49.2 ms 
vs. 5*10 ms = 50 ms). 

Table 3 presents the mean values of the variation coefficient for each class 
of CBR traffic sources averaged over 15 simulation runs. For the 20 kbps CBR 



Table 3. Variation coefficients for the different CBR sources 



Data rate [kbps] 


16 


20 


30 


50 


100 


Variation coefficient [%] 7.3 


12.2 


9.1 


15.3 


11.8 



sources, the mean of the variation coefficient is as high as expected, but for the 
other CBR sources it is either unexpected high or unexpected low. This might 
be because a perfect synchronization will not happen since the average queue 
size oscillations have a high variation coefficient of more than 50%. 

Therefore, theoretical predictions about what data flows becomes synchro- 
nized seems to be complex. Instead, simulations should confirm that traffic phase 
effects are responsible for the unfairness between the data flows. 



The Role of Start Times Synchronization effects should depend on the par- 
ticular point of time when a CBR traffic generator starts sending packets. The 
following section analyzes this dependency. 

In the following set of simulations, the start of a particular data flow of the 
20 kbps data source is delayed from 1 to 50 ms in 1ms steps. In order to have a 
high possibility to get a significant improvement in the achieved throughput, the 
data flow with the least throughput (10.8 kbps) of the first performed simulation 
is chosen to be delayed. Since the inter-packet time is four times the average 
queue size oscillation period, the achieved throughput of the particular data 
flow is expected to oscillated with a resulting period of roughly 12 ms, leading 
to an expected amount of four minima resp. maxima. 

Figure 4 depicts the result of the simulations. The x-axis shows the delay of 
the starting time in milliseconds. The y-axis shows the achieved throughput for 
this particular data flow in each simulation. 

As expected, the throughput improves significantly if the starting time of the 
data flow is delayed. However, an oscillation of the achieved throughput could 
not be seen as expected. Further simulations with different data flows to be 
delayed have shown not even a significant change in the achieved throughput. 

Figure 5 depicts a sample of the average queue size of the simulation with no 
delayed start time and the simulation with a 1 ms delay. The x-axis shows the 
simulation time whereas the y-axis shows the average queue size in packets. 
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Achieved throughput for a single 20,000 kbps data flow 




Delay of the starting time [ms] 

Fig. 4. Achieved throughput of a a single flow with delayed starting times 



Although the achieved throughput of the particular data flow does not differ 
significantly (10,750 bps vs. 10,710bps) for both simulations, the average queue 
sizes are completely different already after a simulation time of around 4 s. 

Thus, these simulation results cannot be used to show a dependency between 
the periods of the average queue size and the CBR traffic generators since al- 
ready a delay of 1 ms changes the phase of the average queue size oscillations 
significantly. 

However, since randomization has been shown to avoid phase effects, the next 
section performs further simulations with randomization included at different 
places. 

6 Introducing Randomization 

Two possible sources of the phase effects can be identified: synchronization of 
the five traffic sources on a single node (intra-node synchronization) or synchro- 
nization of the traffic sources on different nodes (inter-node synchronization). 
To examine the effect of the former, the first simulation was repeated with ran- 
dom starting times for each CBR traffic generator. As a result (cf. table 4), the 
variation coefficient is slightly lower (3-11%). 

Hence, the intra-node synchronization has an influence on the traffic phase 
effects. However, since the variation coefficient remains to be high, the intra-node 
synchronization cannot be the only source of the traffic phase effects. 

Thus, randomization was introduced at two further places (while intra-node 
synchronization was reintroduced again): 

1. At the traffic sources 

2. At the RED gateway 
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Average queue size for a delay of 0 ms and 1 ms 




Fig. 5. Sample of the average queue size for the no delay and 1 ms delay simu- 
lation 



Table 4. Statistical simulation output: Fully randomized start times 



Data rate 
sent [kbps] 


Mean of the received 
throughput [kbps] 


Throughput 
share [%] 


Standard 
deviation [kbps] 


Minimum | Maximum 
data rate [kbps] 


16 


11.5 


71.8 


1.2 


8.3 


13.7 


20 


14.4 


72.1 


1.6 


11.9 


18.0 


30 


22.1 


73.8 


1.8 


19.2 


25.4 


50 


37.3 


74.6 


2.5 


30.7 


40.8 


100 


75.0 


75.0 


2.1 


69.5 


78.2 



6.1 Randomization at the Traffic Sources 

At first, the CBR sources were modified to send with a uniformly distributed 
inter-packet time (±10 ms). As a result, the variation coefficient falls below 5% of 
the mean value (cf., table 5) although intra-node synchronization was introduced 
again. 

Thus, the fairness of dividing the available bottleneck link capacity among 
the CBR flows is improved significantly. In case of a larger interval of the inter- 
packet time (±100 ms), the variation coefficient falls below 1.6%. However, it 
will not always be possible to change all CBR traffic sources in practice. 

In a second step, the traffic generators were changed to be switched on/off 
dynamically during the simulations (so called “Exponential On/Off” traffic gen- 
erators). The on-time for each traffic generator was chosen randomly from an 
exponential distributed random variable with a mean of 5, 25, 33, 50, 75, 100, 
and 125 seconds. During the on-time, the traffic generators send traffic with the 
same constant bit rate as in the previous simulations. The off-time was chosen 
randomly from an exponential distributed random variable as well with a mean 
of 650 milliseconds. Figure 6 shows five graphs of the variation coefficient for the 
five different classes of traffic. 
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Table 5. Statistical output: Randomized inter-packet times at the CBR sources 



Data rate 
sent [kbps] 


Mean of the received 
throughput [kbps] 


Throughput 
share [%] 


Standard 
deviation [kbps] 


Minimum | Maximum 
data rate [kbps] 


16 


11.8 


73.5 


0.3 


11.2 


12.5 


20 


14.6 


73.2 


0.4 


13.7 


15.6 


30 


21.9 


73.1 


1.1 


20.2 


24.1 


50 


37.3 


74.5 


1.7 


33.9 


40.7 


100 


74.6 


74.6 


1.6 


70.9 


77.6 



Variation coefficient for exponential on/off traffic 




on-time fmsi 



Fig. 6. Variation coefficient for the exponential on/off traffic generators 



On the x-axis, the mean of the on-time is depicted. The y-axis shows the 
resulting variation coefficient for each of the five different classes of traffic. As a 
result, the variation coefficient raises with the length of the on-time. In case of the 
highest on-time, the variation coefficient is as high as in the CBR traffic generator 
scenario. Thus, the introduction of Exponential On/Off traffic generators led to 
the necessary introduction of randomization to avoid the traffic phase effects. 

6.2 Randomization at the RED Gateway 

One possibility to introduce randomization at the RED gateway is not to drop 
the arriving packet, but to drop a randomly chosen packet from the queue. This 
way, the standard deviation can be reduced, but only slightly below 10% of the 
mean value (cf. table 6). This randomization leads to a small improvement 
of the fairness only. The introduction of randomization at the traffic sources 
performs much better. Dropping an arbitrary packet from the packet queue is 
easier compared to changing all CBR traffic sources as only changes at a single 
place in the network are necessary. However, dropping a packet from the middle 
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Table 6. Statistical simulation output for randomized drop 



Data rate 
sent [kbps] 


Mean of the received 
throughput [kbps] 


Throughput 
share [%] 


Standard 
deviation [kbps] 


Minimum | Maximum 
data rate [kbps] 


16 


11.8 


73.8 


0.5 


10.8 


12.7 


20 


14.4 


72.0 


1.4 


12.0 


16.9 


30 


22.0 


73.3 


1.1 


20.0 


24.6 


50 


37.1 


74.2 


1.7 


34.3 


40.1 


100 


74.7 


74.7 


0.9 


72.5 


76.2 



of the queue is difficult to implement efficiently. Therefore, an efficient solution 
to introduce randomization, which is also easy to implement, is still missing. 

Another possibility to introduce randomization at the RED gateway is to 
increase maxp so that there is a higher influence of the early-drop phase 
(cf. sect. 2.2). Figure 7 depicts how the variation coefficient changes for each 
class of data flows when maXp is varied. 



Variation coefficient for CBR traffic Variation coefficient for CBR traffic: cutout [0.25:0.4] 





max_p max_p 



Fig. 7. Variation coefficient for simulations with different maXp values 



As long as maXp is small enough (below 0.25) so that the early drops cannot 
prevent the link from being overloaded, the variation coefficient remains high. 
If maXp becomes sufficiently large (above 0.36) so that the load of the link 
falls below 100%, the traffic phase effects disappear. In between both values, 
the variation coefficient quickly becomes smaller. Thus, a sufficient amount of 
randomization to improve the fairness situation may be introduced by setting 
maXp to a sufficiently high value, depending on the load on the bottleneck link. 

In summary, this section has led to the strong presumption that the unfair- 
ness of RED with regard to CBR data flows is caused by traffic phase effects. 
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Randomization at the traffic sources (either for CBR or for Exponential On/Off 
traffic generators) does improve the fairness as well as tuning the maxp param- 
eter at the RED gateways. 

7 Summary 

In this paper, the problems of UDP-based CBR traffic sources in conjunction 
with RED gateways have been analyzed briefly. Traffic phase effects can lead 
to an unfair division of bandwidth among CBR data flows. The introduction of 
randomization at either the traffic sources or at the RED gateway improves the 
fairness either when the traffic sources do not send strictly CBR traffic anymore 
or when it is ensured that randomization at the RED gateway is in place. 

Further work will be done evaluating the Generalized RED queueing disci- 
pline in our simulation to enable the usage of service differentiation by drop 
priorities. Additionally, the traffic model will be extended to variable bit rate 
traffic, e.g., as produced by video coders. 
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Abstract. To overcome IP packet loss for continuous media applica- 
tions, efficient error control mechanisms are required that take into ac- 
count real-time constraints while trading off additional bandwidth and 
application level service quality. For multicast error control, heterogene- 
ity caused by different receiver loss resiliency characteristics must be 
addressed as well. Related work has shown the advantages of hybrid 
error control that combines Forward Error Correction (FEC) with re- 
transmission of parity packets (ARQ). We present a novel scheme for 
dynamically adapting hybrid error control to changing network condi- 
tions that minimizes the impact of packet loss on the received media 
quality. For the adjustment of the appropriate amount of redundancy an 
optimization mechanism using a set of equations which reflect network 
and media characteristics is applied. Analytical and simulation results 
are presented demonstrating the usefulness of the new scheme. 



1 Introduction 

In today’s Internet there is an increasing amount of real-time multicast traffic for 
applications like teleconferencing, video distribution and distributed games. As 
IP services with QoS support (IntServ [3] and DiffServ [I]) are frequently either 
unavailable or costly, error control schemes are important that allow to use these 
applications with best effort IP services. Recent work [8] has demonstrated the 
advantages of retransmitting parity packets for efficient multicast error control. 
It has also been shown [10] that for real-time applications, proactive FEC can 
considerably reduce the required number of retransmissions for real-time appli- 
cations. However, good solutions for adjusting the right amount of redundancy 
in changing network conditions are still missing: Too little redundancy may re- 
sult in too many retransmission rounds needed for the loss recovery, while too 
much redundancy increases the loss rate of all flows over a common bottleneck. 
We propose an error control scheme for real-time multicast that allows to adjust 
and to optimize the number of parity packets for the initial transmission as well 
as for subsequent retransmission rounds. A set of equations for the adjustment 
of the amount of redundancy is presented that describes the adaptation to net- 
work and application characteristics. The remaining sections are organized as 
follows: in section 2, the new scheme is introduced together with related work. 
Simulation results are presented in section 3. Section 4 concludes the paper. 
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2 Real-Time Reliable Multicast Error Control 

The majority of work on multicast error control focuses on non-real-time ser- 
vices. Huitema [6] studied the benefits of FEC when a separate retransmission 
scheme is used on top of the FEC layer. Nonnenmacher et al. compared a num- 
ber of FEC/ARQ schemes in [8] and showed that an integrated scheme with 
ARQ of parities achieves the lowest network overhead since a fixed set of repair- 
packets can recover a large variety of losses. Rubenstein et al. presented a repair- 
technique called proactive FEC [10] which allows to decrease the expected time 
for a reliable receipt of data while achieving a required probability of successful 
delivery. Receivers should request more repair packets than they need to achieve 
a certairr probability for a successful last retransmission or a certain probability 
for a successful reception irr a maximum rrumber of retransmission rounds. The 
drawback of this protocol is that it tries to fulfill all requests of the receivers 
at the cost of bandwidth which is not suitable for congested networks. Further- 
more, the selection of a appropriate amount of proactive FEC is not mentioned. 
Kermode introduced a hierarchical protocol (SHARQFEC [7]) with the ability of 
localizing repair traffic and the selective addition of FEC. Though SHARQFEC 
is by nature a protocol for reliable multicast it introduces features which are 
also suitable for real-time multicast. The protocol uses a hierarchy of adminis- 
tratively scoped nested regions to restrict the range of repair traffic. Receivers 
are combined to repair zones based on subtrees. Each zone has a zone closest 
reeeiver (ZCR). For every repair zone the maximum local loss count defines the 
amount of redundancy added to this region by the ZCR. FEC can be selectively 
used in regions with higher losses. The amount of redundancy added for every 
region should depend also on the average loss and not only on the maximum 
loss rate like in SHARQFEC to improve the use of bandwidth. As regions may 
have largely varying loss rates to different receivers, adjusting the amount of 
redundancy to the receiver with the highest losses may be highly inefficient. 

2.1 New Scheme: Adaptive Real-Time Scoped Hybrid ARQ FEC 

In the following we present a new scheme for adaptive hybrid error control for 
real-time services. The scheme has been designed as an extension of SHARQ- 
FEC. The main differences to SHARQFEC are the support for real-time flows 
and the improved scheme to adjust proactive redundancy dynamically. Key fea- 
tures of the scheme are the following: Feedback consists of NAKs, information 
on the service quality (’’service value”) and the network conditions (like RTT 
and loss rate) of individual receivers. Both senders and receivers control the 
amount of proactive redundancy using service value functions. These functions 
allow to adapt the error control scheme according to the needs of receivers that 
value this most, as well as to the media characteristics and current network con- 
ditions. The scheme takes the negative impact of additional traffic on the loss 
rate into account. This is performed by adaptivity policies, which also reflect the 
network load conditions. The scheme complements TCP-friendly multicast rate 
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adaptation by indicating which part of the consumed bandwidth would be used 
for FEC. 

General Assumptions. After joining the multicast session, each partici- 
pant receives the real-time flow via RTF /UDP on top of IP multicast from the 
sender. ADUs (Application Data Units) are transmitted as a transmission group 
combined with encoded proactive FEC (parity) packets to the so-called block 
(we employed Reed-Solomon Erasure Codes [9]). If the receiver does not get 
sufficient packets to decode the transmission group out of the received block it 
sends a NAK with the number of additional repair packets needed. This pro- 
cedure is repeated until the receiver has sufficient packets for decoding or until 
the play-out point for the ADU has been reached. Beside this procedure every 
receiver has to generate its individual service value to evaluate the current qual- 
ity of service by applying a user specified service value function. Every receiver 
periodically sends the service value, the play-out delay, the loss characteristics of 
the received packets and the RTT to the sender. Feedback of the receivers is used 
in two ways: a) to decide if the current service quality satisfies the one required 
by the receivers and b) to generate information necessary for optimization of 
the error correction while maintaining a high service quality. To decide if the 
parameters of the transmission should be modified, the sender uses the service 
value policy. This policy defines how the sender deals with unsatisfied receivers. 

Service Value. For properly adjusting the error control scheme to applica- 
tion requirements and network conditions, the impact of packet losses on quality 
of continuous media services is important. For such an assessment we define the 
service value to be a number between 1 (lowest possible service value) and 5 
(highest possible service value) like MOS (Mean Opinion Score) results of sub- 
jective tests [4]. 

Service Value Policy. The service value policy is used by the sender to 
decide if it should change the parameters of the transmission to increase or 
to reduce the quality at the receivers by influencing the loss rate. One has to 
remind that a single bad receiver can influence a whole group of good receivers 
if the service policy demands that all service values are over a certain level. 
Therefore, we decided to keep the deviation from the maximum value below a 
certain value for the whole group. For this policy the sender only needs to know 
the information of the unsatisfied receivers, for all other receivers it assumes 
that they are satisfied. This matches well with our scheme for the feedback 
regarding the scalability which depends on the group size and on the constraint 
that discontent receivers issue session messages much more often than satisfied 
ones. 

Error Control. Improvement of error control is one way to increase the 
quality of real-time traffic. The main goal is to reduce the ADU loss rate at 
the receivers. Error control should allow to repair as many losses as needed to 
keep the service quality high. Two tasks are to be fulfilled: the first is to And all 
possible solutions in order to achieve a certain service quality for the receivers 
and the second is to choose the most effective solution that realizes this. 
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Bandwidth Adjustment. Reducing the ADU loss rate can be done by en- 
larging the bandwidth for error correction. There are three possibilities to obtain 
this additional bandwidth: adding it to the current bandwidth consumed, reduc- 
ing the bandwidth used for the transmission of data or applying a combination of 
both. Since the number of available retransmission rounds is limited the only way 
to further enlarge the bandwidth for error correction is to increase the fraction 
of FEC (i.e. the amount of proactive redundancy). Many protocols and schemes 
exist that add proactive redundancy to the initial transmission ([7], [10]). How- 
ever, since the maximum consumable bandwidth is often limited, such a limit 
should be determined by a separate mechanism which we do not consider fur- 
ther. We concentrate on the adjustment of a suitable ration of bandwidth for 
data transmission and error correction. 



2.2 Service Value Equation 



For the calculation of the service value equation we assume a star topology with 
randomly distributed losses on the links described by the loss probability p. The 
following variables are used: the number of data packets k, the number of parity 
packets h and the number of packets of the block n with n = k + h. 

Retransmission Rounds. First the maximum number of retransmission 
rounds for which proactive redundancy could be adjusted has to be calculated. 
It is better to limit the number of retransmission rounds with adjustment of 
proactive redundancy depending on the characteristics of all receivers (especially 
if retransmissions are sent by multicast to all receivers). Otherwise, receivers with 
short RTTs would be favoured and moreover, the paths of high RTT receivers 
burdened. The number of available retransmission rounds with an adjustment of 
proactive redundancy (rr) should be calculated by dividing the mean playout- 
delay by the maximum RTT of all receivers. The complete transmission of one 
block consists of the initial transmission and up to rr retransmission rounds 
with adjustment of proactive redundancy. We number the rounds starting with 
round 0. In this round the initial transmission takes place and the receivers send 
their first NAKs. In each round (r > 0) the sender reacts to the feedback from 
the previous round. It issues as many repair packets as requested and additional 
repair packets equal to the current amount of proactive redundancy adjusted for 
this retransmission round. 

The equation of the service value is composed of the following set of equations 
that reflect the influence of network conditions and media characteristics: 



P{roundr} = J2i=kr i ’'~i ~ ® with 0 < r < rr 

P\^transm,y = I — ^I — P\^roundo\^ ^1 — • • • ^1 — P\^roundrr\^ 

^R{transm. Cnoise^^noise 

WZth Cdist -\- Cjioise — I 



( 1 ) 

( 2 ) 

( 3 ) 
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SVgaps = CiSVbwi^) + C2SVbw{‘2) H with c* = 1 (4) 

SVtypes — (-A) SVgaps “t“ * * * Wbth Cj^ S Cpb “t“ ‘ ‘ ‘ — 1 (5) 

Equation 1 represents the fundamental equation for the service value which 
is the calculation of the probability of a successful initial transmission or re- 
transmission in any round. For each retransmission round the mean number 
of requested repair packets that is needed can be calculated ([12]) according 
to works of Rubenstein [10] and Nonnenmacher [8]. To compute the requested 
repair packets for the first retransmission round we use the variables = k 
and ho = h. The calculated number equals the mean number of repair packets 
the sender has to issue in the first retransmission round and is defined as ki . The 
number of proactive redundancy packets possibly added to the first retransmis- 
sion is defined as hi. With these two variables the sender can compute the mean 
number of requested repair packets for the second retransmission round which 
is defined as k 2 - By repeating this procedure the sender can calculate numbers 
of requested repair packets for all retransmission rounds up to round rr. Taking 
the probabilities of a successful reception for all phases of the transmission for 
one block we get the equation 2 for the probability of success for the complete 
transmission. Consequently, the ADU loss probability padu can be estimated 
by: PADU = 1 ~ P{transm.}. For every combination of ho, hi, . . . hrr another 
equation will be built because if h^ increases the number of requested repair 
packets for the following retransmission round, kr+i decreases and vice versa. 
Equation 3 represents the impact of bandwidth utilization. The more bandwidth 
is allocated for error correction, the smaller the ADU loss rate {padu) is. We 
defined this service value as SVdist since typically lost ADUs are perceptible by 
temporary ’’distortion”. However, since the maximum available bandwidth is of- 
ten limited, the bandwidth for data transmission is also reduced. Obviously, less 
data bandwidth will reduce the perceived quality by continous ’’noise”. There- 
fore, we defined this service value as SVnoise- The service value for bandwidth 
should be a combination of these two components. The components are both 
weighted with a factor representing their relative importance. For instance, using 
waveform-coded audio, lost ADUs lead to distortions caused by signal interrup- 
tions [11,2] while the available data bandwidth is perceived by the amount of 
quantization noise (assuming that the data bandwidth is varied by adjusting the 
bit resolution of the audio signal [14]). The influence of loss distribution can be 
captured by equation 4. While some single missing ADUs can be tolerated up 
to a certain degree, a gap of consecutive ADUs typically leads to a higher loss 
impact (assuming the same mean loss rate for both cases) depending on the cod- 
ing scheme. Therefore, proactive redundancy should be adjusted over successive 
transmission groups to be able to avoid this. Finally, equation 5 can be applied 
if there are ADU types with different importances for the service quality (like 
different frame types in video). The specific utility functions (for instance for loss 
gaps or frame types) are included as expressions for SVdist in the respective SV 
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Table 1. Ranges for the service value equations 



equation for 


range 


influenced by 


P{round} 


one round 


mean loss 


P{transm.} 


transmission group 


RTT 


SVbu, 


one ADU 


bandwidth 


SVgaps 


consecutive ADUs 


mean loss gap 


SVtypes 


different ADUs 


media 



equation for the corresponding loss gap or frame type. In Table 1 the ranges for 
the different equations can be seen together with the parameters that influence 
them. 

With this set of equations, a single equation for the service value can be gen- 
erated for specific network conditions and media characteristics. By using this 
equation for the adjustment and the optimization of the amount and distribu- 
tion of the proactive redundancy, the error control can be adapted to special 
conditions and thus can work in an improved way for the service. 

Service Value Example. For illustration of the new measure we give a sim- 
ple example. We assume that bandwidth limitations allow five packets (which 
consist of either data or redundancy) to be used for transmitting each ADU of 
a continuous media application. The quantity of redundancy packets determines 
the ability for error correction. In our example the highest possible data band- 
width is five, i.e. all five packets can consist of data. If data packets are replaced 
by redundancy packets on the one hand the bandwidth consumed for data is 
decreased but on the other hand the ADU loss rate is reduced too. 

With no losses the service value is very high if little or no redundancy is 
used. In this case the service value heavily decreases with higher loss rates. If 
more redundancy is used the service value is lower for low loss rates but remains 
unaffected for higher losses. We obtain the following formula for representing the 
tradeoff: 

*5'U — ^PADU^ a C-noise{,^lbW(latam.ax^^data{X PADu) f) 

The service value with equal factors c can be seen in figure 1. By modifying the 
factors one can alter the graphs in the following way: if on one hand Cdist > 
Cnoise the graphs will move upwards and thus more redundancy will be chosen 
with lower loss rates and the service becomes more reliable. If on the other 
hand c^ist < Cnoise they move downwards and more redundancy and thus fewer 
data packets will be chosen only with higher loss rates and the service tries to 
keep a high quality but with some interruptions. 

2.3 Optimization of Proactivity 

Having many possibilities to change the distribution of the proactive redun- 
dancy for subsequent transmission rounds, a mechanism is needed to choose the 
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Fig. 1. SV for different redundancies and packet loss rates with Cdist = 

Cnoise — 0-5 



most suitable solution. Such a mechanism has to calculate the probability of 
success (service value) for all possible amounts and distributions of the proac- 
tive redundancy by varying hi V i S [0,rr]. A limit has to be defined for the 
maximum amount of proactive redundancy for each round. We limited hi to the 
same value as for ki, that is 100% proactive redundancy for each round. To be 
able to choose the most suitable distribution of proactive redundancy the mech- 
anism needs optimization criteria. A simple but very effective criterion is the 
bandwidth consumed. For its estimation we used the case of multicast retrans- 
missions. While ko and ho are always sent, k\ and hi are issued only with the 
probability 1 — P{roundo} (i.e. the probability for a failed first transmission), 
a.s.o. Therefore, the average number of packets necessary for one transmission 
group can be calculated as follows (an example can be found in [12]): 

rr i— 1 

packets = ko + ho + {ki + hi) (1 — P{roundj}) 

i=l j=Q 

The bandwidth consumed for the transmission of data is expressed by the vari- 
able ko- The other part of this equation expresses the bandwidth used for error 
correction with all ki {i > 1) representing the portion of reactive redundancy 
(ARQ) and all hi {i > 0) representing the portion of proactive redundancy 
(FEC). A new adjustment of proactive redundancy takes place if the service 
policy demands an improvement of the service value. The goal is to reduce the 
loss rate of the ADUs at the receivers. For that it is necessary to increase the 
bandwidth for the error correction. Obviously, the more proactive redundancy is 
added, the higher the probability of success and thus the lower the ADU loss rate. 
Nevertheless, with larger amounts of proactive redundancy (especially in the ini- 
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tial transmission) the difference in the resulting probability of success is getting 
smaller and the cost (in terms of bandwidth consumption) higher. Therefore, 
the gain of a successful transmission always has to be looked at in connection 
with the necessary bandwidth. One can distinguish the following principles for 
an optimization (figure 2), which is to find one of the following: 

a) the smallest bandwidth to achieve a constant service value 

b) the highest ratio of the service value and the consumed bandwidth 

c) the highest service value with a constant bandwidth 



a) SV = const. 


DATA 


EC 








b) SV 
bw 


DATA 


-► EC -4- 









bw 



DATA ^ 


-► EC 








Fig. 2. The principles for optimization regarding the bandwidth 



The first principle should be used if the network load is low (and thus error con- 
trol is used to ensure very low ADU loss rates) and the bandwidth enlargement 
leads to almost no increase of the packet loss rate. In such a situation the main 
target is to keep the service value as constant as possible at the demanded high 
value. If the network load is neither very high nor very low an increase of band- 
width for the error correction should depend on the gain in terms of achieved 
service quality. Therefore, the service quality should be related to the bandwidth 
consumed for it which is represented by principle b) . Principle c) should be used 
in cases of high network load. A further enlargement of the bandwidth leads 
with a high probability to an increase of the packet loss rate. This reduces the 
service quality not only for the receivers but also for the other traffic sharing the 
same links. Therefore, the bandwidth should be kept constant and might even 
be reduced by a bandwidth adaptation mechanism (c.f. TCP-friendly adapta- 
tion schemes like in [13]). Furthermore, flows that do not reduce their bandwidth 
might be penalized by routers too [5] , so that the service quality of the receivers 
decreases further. The decision which of these three possibilities to choose de- 
pends on the traffic mix and on the bandwidth used regarding the bottleneck 
bandwidth. 
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3 Simulation 

The simulations were carried out using the network simulator ns-2 [15]. We set 
most of the parameters as in [7], using a simple topology with four receivers. 
The sender emitted 200 ADUs, each consisting of 8 data packets, at a rate 
of 800 kbit/s. We wanted to test the schemes with a heterogeneous scenario 
and therefore, we set the loss rates of each link to either 0.05 or 0.2 and the 
RTTs to either 60 or 100 msec. The play-out delay was adjusted to 550 msec. 
Since only the real-time flow was simulated, a loss model for every link has 
been used to cause losses in the flow. We used two measures to be able to 
compare the performance of the different schemes: the mean deviation and the 
mean quadratic deviation of the service value from the maximum (which is 
always 5). The quality is optimal if all receivers have a service value of five 
over the whole simulation time. Therefore, the difference between the current 
service value and the optimum value, respectively its square, was added for 
each receiver and divided by the time. Finally, the mean value of all receivers 
was calculated. Furthermore, we looked at the amount of issued packets (data 
and repair packets) to be able to compare the effort to realize this deviation. 
Since this quantity strongly vacillates we introduced an average over the last 20 
transmission groups. 

3.1 Variation of Play-Out Delay 

Since the play-out delay together with the current RTT mainly define the avail- 
able number of retransmission rounds they influence the success of several error 
correction schemes. Therefore, we first investigated the dependency of the devia- 
tion of the service value on various play-out delays. In our test topology the RTT 
is either 60 msec or 100 msec. The NAK suppression delay varies between 2 and 
2.5 times of the time estimation for the distance to the sender (1/2 RTT). We 
compared the original SHARQFEC mechanism with our scheme. In figure 3 the 
relative service quality deviation can be seen. Each graph has nearly the same 
form which is explainable by the number of retransmission rounds that can be 
used. If the play-out delay is very small (below 400 msec) there are only one 
or two retransmission rounds available for the receivers with small RTTs and 
even none for the others. Therefore, the deviation strongly decreases with higher 
play-out delays. If the play-out delays grows higher than 400 msec the devia- 
tion is further reduced since more retransmission rounds are usable. Especially 
if there is only a small play-out delay (and thus only a few or no retransmission 
rounds at all) available our scheme leads to a smaller deviation and thus to a 
better service quality. The amount of proactive redundancy is adjusted in such a 
way that more losses can be recovered with the available retransmission rounds. 
As our scheme adjusts only redundancy if the service policy decides so, the de- 
viation is slightly higher with long play-out delays (more than 650 msec). Since 
only a few or even no adjustments have been done only a very small amount or 
even no proactive redundancy is added. This is because the majority of losses 
can be repaired completely by retransmission alone. Nevertheless, by modifying 
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Fig. 3. Service quality deviation depending on the available play-out delay 



the tolerated deviation from the maximum value (which results in other service 
policies) the deviation of the service quality can be reduced even for long play- 
out delays. For all other simulations the play-out delay was set to 550 msec to 
allow some retransmission rounds (at least two for the receivers with high RTTs) 
if not otherwise stated. 

3.2 Variation of Loss 

The next property investigated was the dependency of the service quality devia- 
tion on a varying packet loss rate. For this purpose, we modified the loss rates at 
the links of our test topology between 0.01 and 0.2 (the loss rate was the same 
at all links). Figure 4 shows again a comparison of the original SHARQFEC 
mechanism with our scheme, which adjusts the redundancy to all retransmission 
rounds. Our scheme performs better than SHARQFEC for all investigated loss 
rates because the amount of proactive redundancy is adjusted in such a way that 
more losses can be recovered with the available retransmission rounds. 

3.3 Error Control Mechanisms 

In this section we compare several schemes to adjust proactive redundancy. As 
a reference we took SHARQFEC which changes only the amount of proactive 
redundancy for the initial transmission (/iq) depending only on the zone loss 
count of the previously sent blocks (figure 5: we applied SHARQFEC with- 
out scoping and injection of redundancy by receivers). Although the amount 
of proactive redundancy alters between 2 and 5 packets (with a mean value of 
slightly more than 3 packets) the service quality deviation can be only partly re- 
duced. This is because proactive redundancy mainly depends on the short-term 
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Fig. 4. Service quality deviation depending on the packet loss rate 





Fig. 5. Service quality deviation and issued packets of SHARQFEC 



measure zone loss count and therefore, it is not sufficient for blocks with higher 
losses if the previous sent blocks suffered only smaller losses. Consequently, a 
long-term adjustment should be applied. Therefore, we evaluated a scheme with 
an alteration of ho. It is similar to the idea applied by Rubenstein ([10]). The 
mechanism calculates the probability of success for every possible number of 
proactive redundancy packets and chooses the solution with a minimum suffi- 
cient probability (figure 6). The results for that scheme are much better because 
just in the first adjustment ho was set to 3 which leads to a much shorter period 
of the service quality degradation. In the next two adjustments the amount of 
proactive redundancy is stepwise increased which results in a service quality de- 
viation close to zero. However, it should be possible to reduce ho somewhat by 
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Fig. 6. Service quality deviation and issued packets of a mechanism which opti- 
mizes ho 



adding some proactive redundancy to the retransmission rounds. Therefore, we 
tested our scheme with an adjustment of all hi values, (figure 7). Our scheme 
reaches a similar service quality deviation compared with the sole adjustment 
of Hq. Furthermore, fewer proactive redundancy packets were adjusted for the 
initial transmission. On the one hand, this results in fewer packets sent in the 
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Fig. 7. Service quality deviation and issued packets of a mechanism which opti- 
mizes all hi 



initial transmission but on the other hand, the amount of packets issued during 
the retransmissions can be larger (this is shown in figure 7 if one compares the 
smallest and the highest values of the issued packets) . To avoid the degradation 
of the service value until a suitable amount of proactive redundancy is adjusted 
we combined our scheme with SHARQFEC which changes /iq depending only 
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on a short-term measure (figure 8: in the figure only the amount of proactive 
redundancy adjusted by our scheme can be seen). The gain of such a combined 
mechanism is significant because the deviation of the service quality is the lowest 
of all investigated schemes for the adjustment of proactive redundancy. This is 
because the short-term adjustment of SHARQFEC avoids strong degradations of 
the service quality (especially at the beginning of the session or if the packet loss 
rate strongly changes) while our scheme adjusts a certain amount of proactive 
redundancy which realizes a constant service quality. 
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Fig. 8. Service quality deviation and issued packets of a combination of SHAR- 
QFEC and optimization 



4 Conclusions 

For error control of multicast transmissions recent work showed that a com- 
bination of FEC- and ARQ-methods is most effective. The new aspect of this 
work is the distribution of additional proactive redundancy for the retransmis- 
sion rounds, too. Therefore, it is possible to transmit with a higher probability 
of success and thus further reduce loss and delay. A new measure - the service 
value - was introduced to offer a possibility to evaluate the influence of losses 
on real-time streams depending on the bandwidth, the loss distribution and the 
transmitted media characteristics. Using this new measure a set of equations was 
generated which is able to adapt the error control strategy to the service value. 
A single equation is built from this set and then applied by the scheme for the 
adjustment and the optimization of the amount and distribution of the proac- 
tive redundancy. Combined with SHARQFEC our scheme leads to a significant 
increase of the service quality since the proactive redundancy is influenced in 
two ways: The short-term adjustment of SHARQFEC avoids significant degra- 
dations of the service quality due to changing network conditions. The long-term 
adjustment of our scheme allows an adaptation of the error control to the specific 
network conditions and usage scenarios. 
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Abstract. Architecting networks capable of providing scalable, ef- 
ficient, and fair services to users with diverse QoS requirements is 
a pressing problem. The two principal issues are: design of “good” 
per-hop behavior and edge control. In previous work [2,3], we studied 
aggregate-flow QoS control from a noncooperative resource provisioning 
context. In [20], the framework was generalized by, one, solving an 
optimal aggregate-flow per-hop behavior problem, and two, showing 
how it can be used coupled with end-to-end label control to facilitate 
scalable and fair QoS when driven by selfish users and service providers. 
In this paper, we focus on optimal aggregate-flow per-hop control 
and complement analysis by experimental performance evaluation. We 
show that user-specified, diverse QoS is efhciently facilitated over the 
optimal per-hop behavior network substrate using adaptive label control 
end-to-end. 

Keywords: Differentiated services, optimal per-hop behavior, adaptive 
label control, class-based IP switching 



1 Introduction 

Architecting networks capable of providing scalable, efficient, and fair services 
to users with diverse QoS requirements is a challenging problem. The tradi- 
tional approach uses resource reservation and admission control to provide both 
guarantees and graded services to application traffic flows. Analytical tools for 
computing and provisioning QoS guarantees [7,8,17] rely on overprovisioning 
coupled with traffic shaping/policing to preserve well-behavedness properties 
across switches that implement a form of GPS packet scheduling. For appli- 
cations needing guaranteed services, the unconditional protection afforded by 
per-ffow resource reservation and admission control is a necessity. For the pop- 
ulation of elastic applications that require QoS-sensitive services but not guar- 
antees, it would be overkill to provision QoS using the mechanisms of per-ffow 
reservation and admission control. In addition to the service mismatch, overhead 
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associated with administering resource reservation and admission control which 
require per-flow state at routers impedes scalability. 

Recently, efforts have been directed at designing network architectures with 
the aim of delivering QoS-sensitive services by introducing weaker forms of pro- 
tection or assurance to achieve scalability [2,6,9,14]. The differentiated services 
framework [1,6,12] has advanced a set of building blocks comprised of per-hop 
and access point behaviors with the aim of facilitating scalable services through 
aggregate-flow resource control inside the network and per-ffow traffic control 
at the edge. By performing a many-to-one mapping from the large space of in- 
dividual flows to the much smaller space of aggregate flow labels, scalability 
of per-hop control is achieved at the expense of introducing uncertainty and 
volatility by flow-aggregation and aggregate-flow packet switching per-hop. Sim- 
plified models of Assured Service [12] and Premium Service [13] were presented 
in [15] and analyzed with respect to their performance when compared with 
simulations. In [10], an adaptive 1-bit marking scheme is described, and the 
resulting bandwidth sharing behavior demonstrated via simulations when the 
priority level is controlled end-to-end. In [9], the authors describe a proportional 
differentiation model which seeks to achieve robust, configurable service class 
separation — i.e., QoS differentiation — with the support of two candidate packet 
schedulers. They use simulation to study the behavioral properties. In previous 
work [2,3,18], we studied aggregate-flow per-hop control mechanisms and end- 
to-end controls motivated by game theoretic considerations — a router performs 
class-based label switching which emulates user optimal service class selection 
with respect to selfishness — without considering the space of all aggregate-flow 
per-hop controls. The generalization to optimal per-hop behavior design was 
carried out in [20]. Stability and efficiency of the overall system — when driven 
by selfish users and service providers — was studied in the game theoretic con- 
text. Other related works include [14,16]. In this paper, we focus on optimal 
aggregate- flow per-hop control and its properties. We complement the theoreti- 
cal development by simulation-based performance evaluation of end-to-end QoS 
provisioning over the optimal aggregate-flow classifier substrate. We show that 
scalable, user-specified diverse QoS is effectively facilitated through the joint 
action of end-to-end adaptive label control and optimal aggregate-flow per-hop 
control. 

The rest of the paper is organized as follows. In the next section, we give a 
description of the QoS provisioning architecture. In Section 3, we present the 
optimal aggregate-flow classifier and its properties. Section 4 shows performance 
results of its QoS provisioning properties. 

2 Network Architecture 

2.1 Overall System Structure 

Assume there are n flows or users. A user i € [1, ri] sends a traffic stream at av- 
erage rate A, > 0 (bps) . We will assume that Ai is given and fixed ( “fixed band- 
width demand”). The case when Ai is variable (“variable bandwidth demand”) 
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is considered separately. It is also possible to interpret Xi as data volume (bytes), 
for example, file size when files are accessed by reliable transport protocols. Let 
X® = ,xl) denote the vector of end-to-end QoS rendered to user i. 

For example, x\ may represent mean delay, x\ packet loss rate, x^ delay jitter 
(e.g., as measured by some second-order statistic), and so forth. We assume that 
all QoS measures are represented such that a smaller magnitude means better 
QoS. Each packet belonging to user i is enscribed with a scalar rji G {1,2,... ,L} 
taking on L distinct values. Typically, the number of users is very large vis-a-vis 
the range of rji, i.e., n ^ L, and per-flow identity — as conveyed by rji — is lost as 
soon as a packet enters the network. 

2.2 Per-hop Control 

Per-hop control consists of a classifier and a packet scheduler. We assume a GPS 
packet scheduler with m service classes and service weights ak > 0, ~ 

for an output port whose link bandwidth p, is shared in accordance with the ser- 
vice weights. It is not necessary to have GPS as the underlying packet scheduling 
discipline — e.g., priority queues, multiple copies of RED with different thresh- 
olds are alternatives — but we will show that GPS has certain desirable proper- 
ties when considering the problem of selecting an optimal aggregate-flow per-hop 
control for differentiated services. The most important component — where de- 
cisions about differentiated treatment of packets enscribed by different ry values 
occurs — is the classifier which is given by a map, ^ : [1,L] ^ That is, n 

flows — effectively L (or less) flows from the router’s perspective since packets are 
scheduled by their label values only — routed to the same output port on a switch 
are mapped to m service classes. For aggregate-flow control, n > L and L > m. 
If L > m, this leads to a further aggregation. For some choice of classifier and 




Fig. 2.1. r] value in DS field of IP datagram is used by the classifier to select 
service class in GPS packet scheduler 



packet scheduler, the QoS received by flow i G [1, n] at a switch is determined — 
explicitly or implicitly — by a performance function cc*, x* = x*(j 7. A), where 
T] = (ryi, . . . , rjn) and A = (Ai, . . . , An). More precisely, flow Fs performance, in 
the aggregate-flow case, is determined by the performance function a;^(j 7 “,A“) 
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associated with service class fc S [1,™] where 

k = am), r,“ = (l,2 ,... ,L), A“ = (A?,A^,... ,A2), and A? = ^ A^. 

That is, the switch sees only (up to) L “super users” or aggregate flows. The 
per-hop control structure is depicted in Figure 2.1. 

There are three properties of the per-hop control which are relevant from a 
QoS control perspective. Let = (0, . . . ,0,1,0,... ,0) denote the unit vector 
whose i’th (i S [1,?^]) component is 1 and 0 for all other components. The 
properties are: 

(Al) for each flow i and configuration rj, x^{r] -k e^) < x'^{ri) and x^{r] — e^) > 

xHvh 

(A2) for any two flows i ^ j and configuration r/, x^{r]+ei) > x^{rf) and x^{rj— 
Gi) < x^{r))\ 

(B) for two flows i ^ j and configuration r], rji > rjj implies x^{t]) < x\rj). 

In the definitions, the range of rj is such that the perturbations remain in the n- 
dimensional lattice, i.e., rj + ei,Tj — Si G [1,L]". Property (Al) states that, 
other things being equal, increasing the label value of flow i improves the QoS 
received by flow i. Property (A2) states that increasing rji will not increase 
the QoS received by any other flow j. Property (B) states that if flow i has a 
higher rj value than flow j, then the QoS it receives is superior to that of flow j. 
We call property (B) the differentiated service property. (B) has the immediate 
consequence x^{rj) = x^rj) iff r]i = rjj. Thus there is no absolute, a priori QoS 
level attached to the rji values. It is the magnitude of rji — relative to other flows’ 
label values — that will determine the QoS received by a flow i. Properties (Al) 
and (B) are desirable from an end-to-end QoS shaping — i.e., label control — 
point-of-view. Property (A2) is imposed by resource-boundedness. 

2.3 Edge Control 

The properties exported by per-hop control — if satisfied — are not sufficient by 
themselves to render end-to-end QoS commensurate with user requirements. 
End-to-end (or edge) control complements per-hop control by setting the value 
of rj per-fiow, either statically — open-loop control — or dynamically — closed-loop 
control — in accordance with user needs. We assume that the network exercises 
access control at the edge such that users are not permitted to assign rj values to 
their packets arbitrarily — if every user assigns the maximum ry value L to their 
flows, then QoS control via 77 loses its meaning (degenerates to FIFO-based 
best-effort service by property (B)). 

User i’s QoS requirement, in general, can be represented by a utility func- 
tion Ui which has the form Ui{Xi,x’‘ ,pi) where Ai is the traffic rate, x® the end- 
to-end QoS received, and pi the unit price charged by the service provider. The 
total cost to user i is given by PiXi- We assume that Ui satisfies the monotonicity 
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properties^ 



dUi/d\i > 0, dUi/dyi^ < 0, and dUi/dpi < 0. (2.1) 

Other things being eqnal, an increase in the traffic rate is favonrably received by 
a nser, so is an improvement in QoS, bnt an increase in the price charged by the 
service provider has a detrimental effect on nser satisfaction. These are minimal, 
weak reqnirements on the qnalitative form of nser ntility. If rj control is allowed 
to be exercised by the nser, then a selfish nser i can be defined as performing the 
self-optimization max^^g[i [/^(Ai, x*,pi) where rji infinences nser i’s ntility Ui 
via its effect on the QoS received xb We assnme Pi{x.’‘) is a monotone (nonin- 
creasing) fnnction of x* — better QoS incurs higher cost — which corresponds to 
the price function exported by the service provider. 

Under the aforementioned assumptions and per-hop control properties, sym- 
metric label control of the form (shown here for threshold utilities) 

^ = (-l)“£||x*(r,)-0*|| (2.2) 

where a = sgn(||0*|| — ||x®(r 7 )||) and sgn(-) is 1 if its argument is nonnegative, 
and 0, otherwise, can be shown to be asymptotically stable. The choice of norm 
is relevant — the QoS vectors, under the usual ordering relation, form a partially 
ordered set — and, in general, subtle effects with respect to QoS ordering and 
resource control can arise [4,5] . We remark that adaptive label control is a simpler 
control problem than congestion control since it does not suffer under the latter’s 
unimodal input-output relation [19]. 



3 Optimal Aggregate-Flow Per-hop Control 

3.1 Semantics of Optimal Per-hop Behavior 

Consider the per-fiow control or classifier problem for n users who choose packet 
labels from [1,T]. Technically, per-fiow classification means n = m, and L is 
either greater or smaller than n. The range L may be finite or unbounded, and 
the variable rji G [1,T] discrete or continuous. When n users mark their flows 
with a value rji drawn from the metric space [1,T] (Euclidean distance) with 
property (Al) satisfied — larger rji values, other things being equal, result in a 
greater apportionment of resources and thus better QoS — rji can be viewed as 
codifying a user’s QoS or resource demand with respect to some measurement 
unit. If network resources are infinite, then a flow’s request can be satisfied based 
on the rji value specified, without consideration of the needs specified by other 
flows (except, possibly, for pricing issues). That is, independence or decoupling 
holds. If, on the other hand, resources are finite — an OC-24 link is shared among 

^ Ui need not be differentiable, nor even be continuous. We use continuous notation 
here for notational clarity; monotonicity is the only property required. 
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bandwidth intensitive users — then the users’ collective resource demand may ex- 
ceed available bandwidth. In the presence of such resource contention, a conflict 
resolution scheme is needed, including the criteria by which allocation is decided. 

Assume available bandwidth is normalized such that total available band- 
width is /r = 1. First, assume rji G R_|_ is a continuous variable over the real 
unit interval [0,1], expressing user i’s normalized bandwidth demand per unit 
flow. Let a = (oi, . . . , a„) with ai > 0, J2k=i = !> represent the fraction of 
resources apportioned by the per-flow classifier to i € [l,n], and let uJi = OLijXi 
denote the fraction of resources allocated to i per unit flow. Under the above 
semantics , given r) and A, the optimization 

n 

min (3.1) 

CX . 

2=1 



measures the “goodness” of a resource allocation a; with respect to users’ codified 
needs rj in the mean-square sense. Since (3.1) penalizes by the difference error, 
the relative importance of higher pi values is preserved, and resources are ap- 
portioned accordingly. For general pi € R_|_, including the discrete and bounded 
case rji € {1, ... ,L} which is of special interest, define the normalization 

.V J (^i Cjmin) / (^max ^min); If ^max Vmin} / ^ 

1^1, otherwise, 

where Pmin, ??max are the minimum and maximum values of { 771 , 772 ,... ,r]n}, 
respectively. Note that fji G [0, 1], and unless all rji values are equal, = 0 
and 77max = 1- Similarly, let oji denote the normalization of uJi. Then, given rj, 
the generalized optimization criterion is 



min y2{fii - . (3.3) 

CK ^ 

2=1 



(3.3) realizes the same semantics as (3.1), however, generalized by the function 
or “code” (not 1-1) given by (3.2) to rji values not restricted to the real unit 
interval [0, 1]. If L is bounded, then the 1-1 function rji = rjij L achieves a similar 
purpose. 



Proposition 3.4 

Oi = (1 - u) 



Given r},\G K", the solution to (3.3) is 



i G [l,n], 



(3.5) 



j^i ^jVj ^i^i 

where Q < v < 1 is a parameter defining a continuous family of solutions. The 
optimal per-flow classifier given by (3.5) satisfies properties (Al), (A2), and (B). 



The parameter v, which stems from the dimension reduction associated with 
(3.2), has an appealing interpretation. The second term in (3.5) corresponds 
to the proportional share achieved by FIFO scheduling, whereas the first term 
corresponds to proportional share of the corresponding virtual flows XiPi, which 
are the original flow rates weighted by their relevancy variable fji derived from 77 ^ . 
Thus (3.5) represents a convex combination of two extreme behavioral modes. 



Efficient Shaping of User-Specified QoS Using Aggregate-Flow Control 265 



3.2 Optimal Aggregate-Flow Classification 

Let us consider the aggregate-flow classifier problem where n > m. Aggregate- 
flow control, whether it has many or few labels, must service n flows using m < n 
service classes which results, in general, in a reduced ability to effectively shape 
end-to-end QoS with respect to the performance criterion (3.3) when compared 
to per-flow control. That is, the minimum value of (3.3) achieved by optimal 
per-hop control is smaller than that of optimal aggregate-flow control. 

An aggregate-flow per-hop control with parameter {m, L) is a function 



: (r/. A) 1-^ (5,0;) 



(3.6) 



where 5 ^ [li^] ^ [1;™] is the classifier and a = (oi,... ,am) is the vector 
of service weights assigned to the m service classes. With respect to end users, 
^m,L induces — explicitly or implicitly — a performance function ^ for each 
user i S [1, n], : (ry, A) a,, where ai = A) > 0 is user i's share 

of the bandwidth allocated by (Pm,L- With a slight abuse of notation, we use ai 
to denote both user i’s {i G [l,n]) apportioned resource, as well as the service 
weight allocated by <Pm,L to service class i {i £ [1, to]). In the per-flow case, they 
coincide. 

Consider a special type of aggregate-flow per-hop control <Pm,L — called Re- 
duction Classifier — whose behavior is completely determined by its classifier 
5 : [1,L] ^ [1,™], in the following sense. Let Sk = {i £ [1,-b] : ^{i) = k}, 
k £ [1,to], denote the partition of [1,L] induced by 5- Then is specified by 
the following procedure: 



•^771, riv, A): 

1. Compute ^ i for each k£ [1,to]. 

2. Compute if for k £ [1, to] as follows. 




&Sk 



m/\Sk\, 



if 3i £ Sk, Pi = 0; 
if 3i £ Sk, Pi = 1; 
otherwise. 



3. Use per-flow optimal solution (Proposition 3.4) with new input 
f) = {p ^, . . . , r)™), A = (A^, . . . , A"*), to solve the reduced per- 
flow classifier problem consisting of to superusers. 



A reduction classifier reduces the L label (or n user) problem to an to user 
per-flow classification problem by aggregation of component flows and centroid 
computation, then solves the reduced problem by applying the optimal per-flow 
classification solution. 
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Theorem 3.7 Let ^m,L be a reduction classifier represented by its classifier 
Then ^m,L is an optimal aggregate-flow per-hop control, i.e., satisfies (3.3) if, 
and only if, is a solution to 

E (3.8) 

fcG [1,771] iGSk 

where the minimum ranges over all reduction classifiers 

Theorem 3.7 shows that an optimal aggregate-flow classifler must efficiently 
cover — in the mean-square sense — the set of label values { 7 ) 1 , fi 2 , ■ ■ ■ , ??«} using m 
centroids { 77 ^,... ,7)™}. Aggregate-flow per-hop control is, mathematically, an 
optimal clustering problem. Unlike its many brethren in higher dimensions that 
are NP-complete [11], the clustering problem given by (3.8) in Theorem 3.7 has 
a poly-time algorithm; e.g., it can be solved by dynamic programming. When 
L = m, the practically relevant case where there are as many labels as service 
classes, optimal aggregate-flow classification has a linear time algorithm. 



3.3 Properties of Optimal Aggregate-Flow Classifier 

To satisfy user I’s QoS requirement, the per-hop control must apportion a 
fraction a* > 0 of the available bandwidth. Let a* denote the minimal such 
bandwidth. We will use <p*(-) to denote the performance function corresponding 
to a;®(-) which allocates — explicitly or implicitly — a service weight to user i for a 
given input 77 . For user i, let Ai = {f] </ 3 ®( t 7 ) > a*}. Thus Ai represents the set 
of configuration where user i’s QoS requirement is satisfied. Let A* = nr=i 
A configuration rj is system optimal if rj G A*, and for all rj' 77 , ip(rj') > <p{r]) 
does not hold. In a system optimal configuration, the users’ QoS requirements 
are met while expending the minimal amount of resources. 

Theorem 3.9 An optimal aggregate- flow per-hop control with parameters L = 
m satisfies properties (AI), (A2), and (B). 



The L = m constraint advanced by Theorem 3.9 coincides with practical con- 
siderations that derive from an implementation perspective. For example, as- 
suming four bits from the TOS field in IPv4 are used to encode the label set 
{ 1 , 2 ,... , 16}, then we may configure 16 service classes at routers, one for each 
of the 16 possible label values. 



Theorem 3.10 Let L = m < n. A* it) if there exists a = (oi, 02 , ■ • ■ , o;„) 
with amin = 0, Umax = 1; 0 < < 1, such that for all i G [1, n]. 



(1-77) 



XiCXi 



A. 

a, 



> + 



1 — 77 Ai 

L — 1 Ej=i AjCtj 



(3.11) 



The left-hand-side of inequality (3.11) denotes a valid service weight vector with 
respect to the optimal aggregate-flow classifier. The second term in the right- 
hand-side of (3.11) quantifies the loss of power due to coarsification. If L ^ oo. 
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then the term drops out. In practice, L is a small finite value (e.g., using 4 bits 
in the precedence field of IP, L = 16). The next result shows that when n L 
(the raison d’etre of aggregate-flow control) relation (3.11) can be tight even for 
small L. 



Corollary 3.12 Under the same conditions as Theorem 3.10, let di = \_{L — 
l)oiJ, i G [l,n]. Then, A* if for all i G [l,n], 



( 1 -^) 



ELi Eoij 



EU 



> a* + (1 — ly) 



A,; 



Ek=l ^Ei:di=k 



(3.13) 



For n:$> L,we can expect ^ — <C 1, and (3.11) gives a tight bound 

Z^fc=i ^ 

on the existence condition of system optimal configurations. We remark that the 
normalization fji = rji f L is able to achieve a higher efficiency with respect to A * . 
The other results hold as well. The optimal efficiency problem, and its relation 
to the present semantics of optimal aggregate- flow per-hop control, is studied in 
a separate paper. 



4 Performance Results 

4.1 Set-up 

We study the dynamical properties of scalar QoS control using simulation when 
the r] values are controlled by selfish users at the edge. We use the LBNL Net- 
work Simulator ns (version 2) as our simulation platform. We have modified ns 
to incorporate various per-hop and end-to-end control mechanisms. QSim [5] is 
a ns-based WAN QoS simulation environment used in the performance evalua- 
tion experiments. We have benchmarked a number of network topologies includ- 
ing vBNS. Figure 4.1 (left) shows a 4-hop topology that emulates the physical 
topology of Purdue Infobahn, a private IP-over-SONET QoS testbed comprised 
of Cisco 7206 VXR routers shown in Figure 4.1 (middle) and (right). Imple- 
mentation of optimal per-hop behavior in lOS with cooperation from Cisco is 
work-in-progress . 

4.2 Structural Properties 

Figure 4.2 demonstrates the properties (Al), (A2), and (B) satisfied by the 
optimal aggregate-flow per-hop control which are critical to the facilitation of 
effective QoS provisioning. The left figure shows packet loss rate across m = 16 
service classes for fixed label values 77 = 1,2,. ..,16, for varying contention 
levels. We observe that service class separation — property (B) — is satisfied across 
different contention levels. Figure 4.2 (middle) shows the effect of varying 77 for 
a fixed user (here flow 16) on the QoS received by all other users (i.e., groups 
of users over 16 service classes). That is, the graphs demonstrate dx’' /drji^ for 
7 = 1 , 2 ,... ,16. We observe that properties (Al) and (A2) hold for the optimal 



268 Huan Ren and Kihong Park 




Fig. 4.1. Left: Flow configuration of a 4-hop benchmark topology. Middle: Log- 
ical layout of Purdue Infobahn QoS testbed. Right: Physical layout of Purdue 
Infobahn IP-over-SONET backbone comprised of four Cisco 7206 VXR routers 



aggregate-flow per-hop control. Figure 4.2 (right) shows QoS separation across 
16 service classes for delay. In general, for properties (Al) and (A2) to hold for 
delay, GPS must be augmented with buffer control. 




Fig. 4.2. Manifestation of properties (Al), (A2), and (B). Left: Service class QoS 
separation (packet loss rate) as a function of bottleneck bandwidth (property 
(B)). Middle: End-to-end QoS as a function of rji6 which is decreased from 1 to 
16 (properties (Al) and (A2)). Right: QoS separation for delay 



4.3 Adaptive Label Control 

Dynamics Figure 4.3 shows the QoS shaping dynamics of a n = 25 user, L = 16 
label, and to = 16 service class configuration for a 2-hop configuration with 7 
groups of users with packet loss rate requirements 0.1 (users 0, 1), 0.15 (users 2, 
3, 4), 0.2 (users 5, 6 , 7), 0.3 (users 8 , 9, 10), 0.4 (users 11-14), 0.6 (users 14-18), 
and best-effort (users 19-24). The latter’s 77 value is fixed at 1. Figure 4.3 (top) 
shows the packet loss trace for users 0, 5, and 11-14 whose QoS requirements 
are satisfied. The bottom plots show the corresponding 77 traces. We observe 
that adaptive label control is able to achieve the target QoS in a distributed 
and end-to-end manner over the optimal aggregate-flow per-hop control network 
substrate. In particular. Figure 4.3 (right) shows the 77 value trace of users 11, 
12, 13, and 14 who happen to possess the same QoS requirement (packet loss 
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rate 0.4). We observe that they converge to the same service class as determined 
by rj. Similar results hold for end-to-end delay. 




Fig. 4.3. Individual flow dynamics. Top: Packet loss trace for users with QoS 
requirements 0.1, 0.2, and 0.4. Bottom: Corresponding rj label value trace 



End-to-End QoS Provisioning Figure 4.4 (left) shows QoS achieved using 
adaptive label control over the aggregate-flow per-hop behavior substrate for 
n = 25, L = TO = 16, and user QoS requirements 0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 
and 0.4. For small bandwidth. A* is empty and no solution exists that satisfies 
all user requirements. As bandwidth is increased, adaptive label control achieves 
both QoS differentiation and user-specified QoS shaping. Figure 4.4 (right) shows 
a 10-fold larger system comprised of n = 250 users, L = m = 16, and seven levels 
of QoS requirements. 200 users — spanning all seven QoS requirement levels — are 
VBR sources (Poisson) and 50 users (also spanning all seven levels) are CBR 
sources. We observe that when resources are sufficient and A* yf 0, then adaptive 
label control — in conjunction with optimal per-hop behavior — is able to differ- 
entiate as well as affect end-to-end QoS commensurate with user requirements. 
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Abstract. We propose a heuristic algorithm for the construction of a 
heterogeneous multicast distribution tree used for video transmission 
that satisfies different QoS requests. Our approach assumes an active 
network where active nodes can filter the video stream. By appropriately 
choosing the active nodes at which filtering is performed, the proposed 
algorithm constructs the multicast tree where, for example, the total re- 
quired bandwidth is minimized. We assume existing multicast routing 
algorithms to form multicast groups, and the resulting distribution tree 
is a hierarchical arrangement of those groups. 

We evaluate the performance of our algorithm and compare it against 
two other approaches: simulcast and layered encoded transmission. Re- 
sults show that we can get advantages when network nodes participate 
in the construction of the heterogenous multicast distribution tree, such 
as the possibility to set up a larger number of simultaneous multicast 
sessions. 

Keywords: heterogeneous multicast, active networking, video filtering. 

1 Introduction 

The problem of multicast distribution of video gets even more complex when con- 
sidering a heterogeneous environment, where different clients joining the same 
multicast session have different quality requests due to limitations in the net- 
work bandwidth or in the end hosts’ processing capabilities. We must deal with 
this heterogeneity processing the original video stream in a way that several 
different quality streams with different rate can be provided, and using a dis- 
tribution method to give each client the stream with the quality closest to its 
requirement. 

In simulcast, the server produces a different video stream for each requested 
quality. This is the easiest way of providing the multiple QoS requirements si- 
multaneously, but leads to the waste of system resources. Our research group 
has proposed the use of flow aggregation [3], where clients with similar QoS re- 
quests are aggregated to minimize the number of streams produced at the sender. 



J. Crowcroft, J. Roberts, and M. Smirnov (Eds.): QofIS 2000, LNCS 1922, pp. 272-284, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 



On the Construction of Heterogeneous Multicast Distribution Trees 



273 



However this still remains as a simulcast transmission, and the server and the 
network would be overloaded when the number of clients is considerably large. 

Layered encoded video has also been considered to address the problem of 
heterogeneity. In layered video, a video stream is decomposed into a base layer 
stream and some enhancement layers. The base layer is enough for decoding the 
video sequence but in the lowest quality, and the reception of additional layers 
is necessary to decode higher quality video. Each client chooses the appropriate 
set of layers to achieve the preferred video quality. McCanne et al. proposed [8] a 
framework for the transmission of layered signals using IP multicast. The advan- 
tage of this approach is that it can be used in the current network infrastructure. 
Problems with layered video include the difficulty in generating the video layers, 
the required bandwidth overhead, and the granularity. Since only a few layers 
can be produced, we get only a few different qualities. We also considered the use 
of flow aggregation with layered coded video [14] and showed that we can accom- 
plish further effective video multicast. However, this approach still retains the 
limitations of layered encoding, although we introduced a way of constructing 
video streams of 12 layers. 

Against the heterogeneity, the research community has gotten interested in 
the possibility of introducing limited programmability inside the network, allow- 
ing the network nodes to decide if it is necessary to modify the contents and the 
routing of data packets after performing some computation. These processing- 
enabled nodes are called active nodes [11, 13]. Expectations for active networking 
are wide-ranging, starting from the possibility of accelerating the rate of intro- 
duction of new protocols to the introduction of novel services at the node level. 
An example is the introduction of new approaches for multicast [1, 7, 9] that 
exploit the advantages of intermediate node processing. 

Heterogeneous multicasting of video is a natural candidate for enjoying the 
benefits of active networking. The server becomes free from providing video 
streams of different qualities if the quality regulation can be performed inside 
the network. At video Altering nodes, new video streams of lower qualities can 
be derived from the received ones at the expense of processing capability. The 
methods for reducing the quality of a video signal depend on the encoding format 
used. 

Research into filtering by Yeadon et al. [16] and Pasquale et al. [10] predates 
active networking research, but propose a Altering propagation mechanism to 
vary the location where Altering occurs according to the requirements of down- 
stream clients. AMnet [9] proposes a model and an implementation for providing 
heterogeneous multicast services using active networking. According to this ap- 
proach, a hierarchy of multicast groups is formed, in which some active nodes 
that act as receivers in a multicast group become roots in other multicast groups, 
but it is not explained how the multicast groups are conformed and how the root 
senders of each multicast group are elected. 

In this work we are facing one of the aspects related with heterogenous mul- 
ticasting using active node Altering: how to decide in which active nodes Altering 
must be performed to achieve an efficient multicast distribution tree. We pro- 
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pose an algorithm that forms a hierarchy of multicast groups, where the top 
level group root is the video server. The members of this group are the clients 
which requested the highest quality video, and one or some active nodes which 
filter the video stream, producing one with lower quality. These active nodes be- 
come roots of other multicast groups to satisfy the requirements of other clients. 
Analogously, these new multicast groups can have one or some active nodes as 
members that become roots of even lower level groups. 

We assume that filtering functions are provided within active nodes. How the 
filtering mechanisms are implemented is out of the scope of this work. 

This paper is organized as follows: Section 2 explains our algorithm in detail; 
Section 3 evaluates its performance, comparing it with other approaches for 
distributing video; Section 4 concludes our work. 

2 Construction of the Multicast Distribution Tree 

In this section we detail our approach for the construction of the multicast 
distribution tree. We are assuming the following: 

1. The server collects all of the client’s requests, and it builds an appropriate 
multicast distribution tree previous to the start of the video transmission. 
We assume the server can get all the information it requires, such as the 
network topology and information about active nodes. The drawback of this 
centralized approach is lack of scalability: the proposed algorithm suffers 
with big topologies or large number of clients. We can alleviate this issue to 
some extent using clustering to group requests from closely located clients 
with similar quality requests. 

2. For simplicity, we assume one QoS dimension, and therefore each client re- 
quest can be expressed by a numerical value, that denotes the requested 
quality. 

3. We can re-filter an already filtered video sequence in order to obtain another 
one with lower quality. We are not taking into consideration the effect that 
delays due to filtering can cause to the perceived quality. 

4. We do not replace existing multicast routing algorithms, indeed we assume 
they are provided by the network layer. Our goal is the formation of the 
multicast groups, i.e., election of the roots and members of each one of them. 
DVMRP and MOSPF [12] are widely used multicast routing algorithms in 
dense mode subnetworks that build spanning trees rooted at the source using 
shortest path techniques. Therefore, we assume that the paths the multicast 
routing algorithm uses are the same as the unicast paths from the source to 
each one of the destinations, as created by Dijkstra’s algorithm. 

2.1 Sketch of the Active Application to Handle Filtering 

Due to the centralized approach at the server, the complexity of the required 
application can be reduced. We would require the following steps: 

1. Service Announcement. The server uses a well-known multicast address to 
inform the possible clients about the session contents. The protocol used 
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Fig. 1. Example of a logical Fig. 2. Example of a state tree 

topology 



to send these messages can be similar to the SAP protocol used in the 
MBONE [5], 

2. Subscription. Each of the clients that wants to participate in the session 
sends a request containing the desired quality. 

3. Calculation of the distribution tree. When this step is completed, we have the 
multicast groups and the nodes that are going to perform filtering defined. 

4. Sending the information to the active nodes designated to perform filtering. 
The server transmits the following information to the nodes: 

~ Multicast address as receiver: the active node is going to get the data to 
be filtered through this address 

— Multicast address as sender: the active node will send the data using this 
address 

~ Filtering code and parameters: the kind of filtering to apply to the stream 

5. Sending the corresponding multicast group address to each client, in order 
to have them subscribe. 

6. Subscription of each client to the corresponding multicast group. 

7. Data transmission. The server transmits the required streams to the multi- 
cast groups in which it is the root. Active nodes inside the network designated 
as filters redistribute the video stream as required. 

2.2 Distribution Tree Construction Algorithm 

Our algorithm forms a distribution tree in a request by request basis, taking 
the requests in descending order of quality. In the case that there are many 
requests with the same quality, we first take the ones from the clients closer 
to the sender. We try to use the sender to stream to the clients that require 
the highest quality, and choose the nodes located in the best place to perform 
filtering. The designated active nodes become the root node of a new multicast 
group of a filtered video stream of lower quality. The filtered stream is then sent 
to clients that demanded lower quality streams. We form a hierarchy of multicast 
groups as is proposed in AMnet [9], as shown in Fig. 1. 
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Each step in the construction of the tree defines a state. The state is defined 
by a variable c that stands for the number of the clients that have been already 
considered, and the characteristics of the distribution tree needed to serve those 
clients, that is, which nodes have been used to filter and produce, if any, the 
stream with the requested quality. Fig. 2 depicts a sample state tree. Each state 
is denoted as c — i, where c stands for the number of clients, i < Nc is the state 
index, and Nc is the number of states in that round. At the first round, there is 
only one state 1-1, where only one client with the highest demand is satisfied by 
being provided the video stream at the required quality directly from the server. 
From a state in round c, it is possible to derive several states for round c-l- 1, 
depending on how the stream that the new client demands has been generated. 

When deriving states from a round in the state tree, we define a set of “can- 
didate senders” to provide the requested stream to the client newly considered 
in the next round. Either the original server of the video sequence or any of the 
active nodes in the network can be the candidate sender. For a given flow request 
and candidate sender, one of the following situations is possible: 

1. The candidate sender is already requested to relay a stream with the desired 
quality by a previously processed client. In this case the client subscribes to 
the multicast group the stream belongs to. 

2. The candidate sender is already requested to relay a stream with a quality 
higher than the one requested. In this case, this stream must be filtered 
at this candidate sender. Then, a new multicast group is created with the 
candidate sender as the root, and the requesting client becomes a member 
of this multicast group. 

3. The candidate sender is not relaying a flow. In this case, the candidate sender 
must first subscribe to a multicast group, filter the stream that receives as a 
member of this group, and become the root of a new multicast group. The 
requesting client subscribes to this new group to get the stream. 

The election of the filtering nodes is based: 

1. On the distance, i.e., number of hops, between the client and the candidate 
node. The first candidate to choose is the closest one to the client that 
already belongs to the distribution tree, i.e., that relays or filters a flow to 
satisfy requests of previous rounds. The next ones are chosen close to this 
one. 

2. On a function / that considers other factors such as total bandwidth used, 
link utilization, and/or the use of node resources. This function can be 
thought as a measure of how good is the complete distribution tree being 
formed. A lower value of / means a better distribution tree. 

For simplicity, we assume only one variable that comprises the node resources, 
and that a filtering operation reduces the value of this variable by a predefined 
amount. If one active node has already exhausted its resources, filtering cannot 
be performed, and it is not considered as a candidate sender. 

As described above, our algorithm belongs to the category of exhaustive 
search algorithms. It means that the number of possible states in each round 
directly affects the efficiency of our algorithm. In the worst case, the number of 
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candidate senders is equal to the number of active nodes in the network, say A, 
plus the original server. In such a case, the number of states JVc in round c 
becomes {A+ 1)““^. Since this is computationally expensive if the number of 
requests or active nodes in the network is not small, two parameters were defined 
to restrict the number of states Nc to analyze: 

— We limit the number of candidate senders to expand in each round to a 
fraction b of the total candidate senders. 

— We restrict the number of new states generated in a round to a maximum 
of m. 

In each round, we select up to a maximum of m states to expand, the states 
chosen are the ones with the lowest values of /. Each state is expanded with 
6 X (A + 1) new states, in which each new state implies a different candidate 
sender elected to satisfy the request of the next client. The election of these 
new states is done by the distance in number of hops criterion explained above. 
We continue expanding the state tree until all the clients’ requests are satisfied. 
Then, the state with the lowest / is chosen. 

2.3 Example 

Figs. 3 and 4 show an example network topology with 10 nodes. Active nodes 
are marked with squares and non-active ones with circles. Client requests are 
indicated with unfilled circles with a number that represents the requested qual- 
ity. The server is attached to node 3. When the sender is attached to an active 
node, we must distinguish if the filtering is performed at the active node, or if 
the stream is provided by the sender. 

The qualities are related with the bandwidth according to the data in Table 1, 
taken from a previous work from our research group [4] for the MPEG-2 video 
coding algorithm [6]. In layered video case, the layers must be piled up to achieve 
higher quality video. For example, the bandwidth required for a stream of quality 
4 is given as 5.19 (layer 1) -|- 3.56 (layer 2) -|- 4.89 (layer 3) -I- 9.01 (layer 4) 
= 22.65 Mb/s. The different qualities are obtained varying the quantizer scale, 
and active nodes derive the video stream of lower quality by de-quantizing and 
re-quantizing the received stream. 

Fig. 3 shows the multicast groups conformed by our algorithm. Arrows show 
the required streams, and arrow tips point to multicast group members. Two 
filtering processes are needed in node 4 and one in node 9. It must be noted 
that active node 4 becomes member of multicast group 1, just to provide filtered 
streams to clients in nodes 1 and 6. 

As a comparison. Fig. 4 depicts the multicast groups needed in simulcast 
transmission. The links between nodes 3-7 and between nodes 7-4 have to carry 
many different quality streams, and for this reason they can become congested 
links. Using the values of Table 1, using simulcast we need to use 26.4 Mb/s in 
these links, compared with the 14.4 Mb/s needed by our algorithm. 

2.4 Applicability to Multiple QoS Dimensions 

For simplicity, we assumed above the existence of only QoS dimension. If we 
lift this restriction, e.g., if clients request a stream with a defined quantization 
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Fig. 3. Multicast groups for our algo- Fig. 4. Multicast groups for simulcast 
rithm 



Table 1. Required bandwidth for streaming video (Mb/s) 



quality 

(quantizer scale) 


single-layer 

video 


layered 

video 


4(10) 


14.4 


22.65 


3 (20) 


8.8 


13.64 


2 (30) 


6.6 


8.75 


1 (40) 


5.4 


5.19 



scale and frame rate, the proposed algorithm is still applicable, but with the 
difference that the stream delivered to each multicast group is characterized 
with two parameters instead of one. That implies that an active node must 
be provided with a stream whose both quality parameters are greater than the 
quality parameters of the stream that it must produce. Once the QoS parameters 
are specified, we can deduct the required bandwidth using an approach proposed 
by previous work of our research group [2] . 

3 Evaluation 

In this section, we show the effectiveness of our proposed algorithm through 
some numerical experiments. We generate random topologies using Waxman’s 
algorithm [15], and choose the parameters appropriately to generate topologies 
with an average degree of 3.5, to try to imitate the characteristics of real net- 
works [17]. We assumed the proportion of active nodes in the network to be 0.5. 
For simplicity, each filtering operation is assumed to use the same amount of 
resources. We also assumed that the number of filtering operations that each 
active node can do is a random value between 15 and 30. The location of active 
nodes is chosen at random. The location of the server, the clients and their cor- 
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Experiment 

Fig. 5. Total bandwidth, 20-node net- 
work 




work 



responding requests’ qualities are also generated randomly, and vary from one 
experiment to the other. Clients can request the video stream in one of four 
available video qualities, according to Table 1. We apply two other approaches 
for multicast tree construction to the same topologies for comparison purposes. 
Those are simulcast and distribution of layered coded video. 

The definition of /, which is used to evaluate the effectiveness of the built 
tree in the algorithm can be modified according to which network parameters 
are most important in the construction of the distribution tree. We performed 
the evaluation using two simple definitions, those are for minimizing bandwidth 
and minimizing link utilization. 

3.1 Minimizing Bandwidth 

In this case the definition of / is the total used bandwidth for the video distri- 
bution tree. We call this definition for / as /i: 

where i denotes a link, U is the set of used links, and Bi denotes the bandwidth 
devoted to video distribution in link i. 

To evaluate our algorithm with /i, we generate ten 20-node and ten 50-node 
network topologies, varying the number of requests among 10, 20 and 50. The 
results are summarized in Figs. 5 and 6. In both figures, the first 10 experiments 
are for 10 requests, the next 10 for 20 requests and the last 10 for 50 requests. 

In general, the proposed algorithm requires a lower total bandwidth than 
simulcast and layered video, at the cost of requiring processing at the filtering 
nodes. When the number of requests is small, the total bandwidth used by simul- 
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cast transmission is even smaller than the one required for layered transmission, 
because the overhead of the latter is not justified owing to the dispersed clients. 
As the number of requests increases, the bandwidth required for layered encoding 
transmission becomes less than the required by simulcast, and becomes closer 
to the one required by our proposed algorithm. Since we fixed the proportion of 
active nodes to be 0.5 in the generated topologies, when we increase the number 
of requests, the number of non-active nodes that have clients attached requesting 
different quality streams is also increased. In those cases, several streams must 
be relayed toward the non-active nodes because filtering cannot be done locally, 
thus increasing the value of fi . 



3.2 Simultaneous Multicast Sessions 

Minimization of the total bandwidth required for video multicasting is intended 
to avoid the extremely high load on the network and let other sessions set up 
their trees. In this subsection, we compare our algorithm increasing the number 
of sessions in the network to see how many sessions can be simultaneously set 
up and provided for users. 

In the experiments, all the links are assumed to have a bandwidth of 100 
Mb/s. We multiplex sessions, each of which is set up according to our algorithm, 
until the bandwidth of any link is exhausted. Here, we should note that the 
network we consider is best-effort and the constraint on the available link band- 
width is not taken into account in our algorithm stated in Section 2. Thus, the 
number we consider here is that of simultaneously acceptable sessions without 
causing a seriously overloaded link. The sessions are independent, and we do not 
use the information of the links used by the other sessions to build the current 
tree. 

In addition to /i, which is to minimize the total bandwidth, we introduce 
other function / 2 , which is related to the required average bandwidth of the used 
links: 



r _ 



( 2 ) 



where i denotes a used link, U is the set of used links, and Bi denotes the used 
bandwidth in link i. We expected that with this definition, our algorithm could 
perform some sort of “load balancing,” to avoid congesting a single link. 

In Figs. 7 and 8, experiments l-IO, 11-20, 21-30, 31-40, 41-50, 51-60 refer 
to 20-nodes 10-requests, 20-nodes 20-requests, 20-nodes 50-requests, 50-nodes 
10-requests, 50-nodes 20-requests, and 50-nodes 50-requests cases, respectively. 

Fig. 7 shows the average bandwidth required to establish the first ten sessions 
at the same time. fi shows the lowest value for all the cases. Even though we 
chose /2 to minimize the average bandwidth of used links in each session, when 
we sum all the sessions, /2 results in the highest values. Between them lie the 
values for simulcast and layered video. When the number of requests is small (10 
requests), the average bandwidth used by layered encoded distribution is greater, 
but for larger number of requests it is surpassed by the values of simulcast. 

As a result of spreading the sessions in a large number of links, the variance 
of the bandwidth for /2 is also reduced. The values for fi are even smaller due 
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Fig. 7. Average bandwidth (Mb/s) for the first ten sessions 



to the reason than the average bandwidth valnes are smaller compared with the 
other cases. We do not show the nnmerical valnes dne to space limitations. 

Fig. 8 shows the maximnm number of simnltaneous sessions that conld be set 
np using each one of the three methods, i.e., the proposed algorithm, simulcast 
and layered distribution. The results show performance in the following order, 
from better to worse: the proposed algorithm using /i, the proposed algorithm 
using / 2 , layered transmission, and simulcast. There were some few cases in which 
our proposed algorithm was surpassed by the layered video approach. We expect 
this to occur when we have the same stream with different qualities over the same 
link, congesting it as it occurs in simulcast. This occurs, for example, when we 
have several clients connected to a non-active node that request different quality 
streams. 

Even when the location of senders are concentrated in a region of the network, 
the advantage of /2 is relatively small although results are not shown in this 
paper due to space limitation. 

3.3 Required Computation Time 

In this paper, we have not analyzed the effect of varying the values of b and m. 
Their election involves a trade-off between required processing time and optimal- 
ity of the obtained solution. Just to have an idea, we averaged the time required 
by our algorithm to generate some multicast trees. For networks with 20 nodes, 
we required an average of 6, 14 and 56 seconds for trees with 10, 20 and 50 
requests, respectively, using m = 20 and 6=1. For networks with 50 nodes, we 
required 101, 238, and 451 seconds for 10, 20 and 50 requests, respectively, using 
m = 20 and 6 = 0.3 for the first two cases and 6 = 0.2 for the last case. We 
evaluated our algorithm written in Java on an 800 MHz Pentium III machine. 
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Fig. 8. Maximum number of simultaneous sessions (out of 15) 

3.4 Other Definitions for / 

We considered above two simple definitions for /: the first one, /i, is simply 
the sum of the bandwidth used in each link of the distribution tree; the second 
one, /2 is an expression for the average bandwidth of used links. With /2 we 
expected to increase the number of possible simultaneous sessions, reducing the 
bandwidth used per link, at the expense of increasing the number of used links. 
The problem with /2 is that it increases greedily the number of used links in the 
tree, sometimes misplacing the filtering location. As we mentioned before in this 
section, some links can get congested when they carry the same video with dif- 
ferent qualities. We could augment the definition of /i with a function that tries 
to avoid the existence of links that carry different quality flows simultaneously, 
adding a penalty value if those links exist. 

We can also consider to limit the number of filtering operations performed in 
the distribution tree or at a single node, if we need to ensure limited use of node 
processing resources. Although we assume a best-effort network, if link capacity 
is constrained, it is also possible to modify / to consider this restriction. 

4 Summary 

We presented an algorithm for electing the filtering nodes in an active network 
to construct a heterogeneous multicast distribution tree, which aims to minimize 
a function / that can be set to consider some network parameters, to achieve 
efficient use of the network resources. 

We evaluated our algorithm choosing two simple definitions for /: the total 
bandwidth used, i.e., the sum of the bandwidth used in each link, and the average 
bandwidth of used links. We compared our algorithm with other two methods 
of distributing video that not consider the use of active nodes: simulcast and 
layered encoded distribution, and found that using our algorithm we can set up 
a greater number of simultaneous sessions, meaning a more effective use of the 
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available bandwidth of the network, but at the expense of requiring processing 
capability at the network nodes. 

In our evaluation, we choose / to take very simple forms. We have not tested 
further, but we think that if we elect other more elaborated appropriate defini- 
tions for /, it’s possible to achieve better distribution trees. 

We also presented a simple outline of how an active application can make 
use of our approach. It can result in a simple implementation, since the tree 
is constructed in a centralized approach. However, due to this reason, we can’t 
expect that the algorithm scales to large internetworks. 

We did not consider changing the distribution tree dinamically due to changes 
in the network conditions, necessary in the case of best effort networks that is 
what we considered. It is left as a future research topic. 
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Abstract. Per-hop behaviours capable of supporting different traffic 
classes are essential for the provision of quality of service (QoS) on the 
Internet according to the Differentiated Services model. This paper 
presents an approach for supporting different traffic classes in IP 
networks, proposing a new per-hop behaviour called D3 - Dynamic 
Degradation Distribution. The approach allows for the dynamic 
distribution of network resources among classes, based on the measured 
quality of service and on the sensitivity of classes to performance 
degradation, without implying any substantial change to current IP 
technologies. A router prototype developed according to the proposed 
approach is presented, along with the results of experimental tests that 
were performed. The test results demonstrate the feasibility and the 
effectiveness of the underlying ideas. 



1 Introduction and Framework 

Differentiated Services (Diffserv) is an architectural framework currently being 
studied and proposed for the Internet, by which different levels of network quality-of- 
service (QoS) can be provided to different traffic streams [5]. The basic principle of 
Diffserv networks is that routers will handle packets of different traffic streams by 
applying different per-hop behaviours (PHBs) to packet forwarding. 

The Internet has been growing at an extremely fast pace and not always in a 
predictable way. The large number of every-day new users induces more and more 
load on the network. It also promotes the emergence of new services which, again, 
result in more traffic being generated. In this context, it is not surprising that one often 
finds very poor quality levels when using the network. 

One possible approach to QoS provision on the Internet is to extend the current 
single-class-best-effort Internet paradigm to a multiple-class-best-effort paradigm. 
This is the approach that is explored in the work presented in the current paper. 

The proposed concept of multiple-class-best-effort provides a way to deal with 
traffic classes considering their relative QoS needs but still treating them as well as 
possible. This can be done by protecting more important classes when the load grows 
and letting less important classes absorb the major part of the performance 
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degradation. Such a change simple and totally compatible with the principles of the 
IP technology can have a strong impact on IP networks behaviour, contributing in a 
positive way to the global satisfaction of IP service consumers and providers. 

The multiple-class-best-effort proposal requires an integrated approach to 
scheduling and queue management. Although significant work exists in each of these 
fields (for instance, WFQ, W2FQ or Stochastic Fair Queuing [4], [19], in the area of 
packet scheduling, and RED and BLUE [12], [11] for queue management) these 
approaches do not work in an integrated way, in addition to having several known 
limitations and problems [18], [6], [27]. 

At the Laboratory of Communications and Telematics of UC (LCT-UC) the 
authors are working on a project which main goal is to study an alternative approach 
to support traffic classes in IP networks following the Diffserv framework, having in 
mind the multiple-class-best-effort paradigm. This effort can be divided into three 
main parts. The first one is to develop new mechanisms in Network Elements (NE), 
such as routers and switches, in order to support the multiple-class-best effort model 
referred before. The second is to conceive a way to select adequate paths for the 
forwarding of packets of different classes along the communication system. The third 
part is to implement an effective way to manage the system, which may include some 
form of traffic admission control at network edges as a way of controlling the 
communication system load. Naturally, the idea is to do all this in an integrated and 
coherent way. 

This paper refers to the NE part of the referred work. It presents the foundations for 
a proposal of a new PHB named D3 (which stands for Dynamic Degradation 
Distribution). Specifically, section 2 presents the main characteristics and basic ideas 
that support D3, focusing the discussion on a router prototype developed to evaluate 
the feasibility and effectiveness of such a PHB. Section 3 presents the tests carried out 
on the prototype and analyses their results. Section 4 summarises the contributions 
and presents guidelines for further work. 



2 A Qos-Capable Router Prototype 

The purpose of D3 is to continuously re-distribute router resources in such a way that 
some classes will be protected from performance degradation at the expense of others. 
Classes have different sensitivity to QoS degradation - a given relative degradation on 
transit delay, for instance, can be unacceptable for some applications but well 
tolerated by others. D3 guarantees that the increase in transit delay and loss suffered 
by packets when the load grows reflects the sensitivity of classes to delay and loss 
degradation. In short, more sensitive classes will suffer less degradation with load 
increase, at the expense of a greater degradation suffered by less sensitive classes. The 
degradation is dynamically distributed among classes, and the control of this 
distribution is made independently for transit delay and loss. 

For the purpose of quantifying the quality of service provided to classes a QoS 
metric presented in [28] was used. This metric is described in the following 
subsection. 
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2.1 QOS Metric for Packet Networks 

The IP Performance Metrics Working Group (IPPM) of IETF [IPPM] is conducting a 
major effort for the definition of a set of standard metrics that can be applied to 
evaluate data delivery services. IPPM considers QoS characteristics (such as delay or 
loss) independently [24], [1], [2], [16], In spite of IPPM and other work, there isn't, as 
far as the authors know, a comprehensive way to measure QoS considering different 
characteristics altogether. The metric developed at LCT-UC intends to address this 
challenge: to provide a broad and coherent view of the QoS given to applications, 
despite the heterogeneous nature of the characteristics being measured (for example 
delay or loss). 

According to this metric, quality of service is quantified through a variable named 
congestion index - Cl. For each class, there will be a Cl related to transit delay and a 
Cl related to loss. 

The concept of degradation slope (DSlope) is used by the metric for the definition 
of classes' sensitivity to delay and loss degradation. A traffic class highly sensitive to 
QoS degradation for a given QoS characteristic will have a high DSlope associated to 
that characteristic. Figure 1 refers to three classes with different sensitivities to delay 
degradation (it would be the same if we were talking of loss). Classes with lower 
DSlope (measured in degrees) will be less sensitive to degradation, so their Cl will 
grow slowly'. 

Using this metric one can say that resources are being shared in a fair way when 
the different classes have the same Cl value related to delay and the same Cl value 
related to loss. In fact, in that case, one can say that the impact of the degradation on 
applications is barely the same for all of them, despite the different absolute values of 
transit delay and losses experienced by their packets. Thus, when a router is dealing 
with n classes, in a given time interval [t; , t;+i] formula (1) should stand. 



Class ■]— m Class _2 _ r'l Class _n — m 

Delay Delay hi Ai+lU ^ Delay hi Ai+lJ - Vlavg hi , ti+ij, 

and 

Class _ 1 -\ r' Class _ 2 Class _n i 

ft ’ ft.ti+l]= ft ^ ti+l] ^ Clavg [ti , ti+i] 



( 1 ) 



The time interval [tj , t;+i] should be relatively small - one can take as reference the 
time it takes to send a few tens of output link MTU-sized packets^. 

Figure 1 (left part), represents the status of a NE at a given instant in time (ti) in 
what respects to the provision of QoS to classes considering only transit delay. CIs are 
equal for all the classes (CIti), corresponding to an average transit delay of dl, d2 and 
d3 for packets of classl, class2 and class3, respectively. At a given subsequent instant 
in time (t 2 ) the NE is experiencing a higher load. In this case, the NE degrades the 



' For simplicity reasons, a linear variation has been considered. It is possible to use different 
variation types, for instance a logarithmic one - which is, perhaps, better suited to represent 
the sensitivity of human beings to QoS degradation [20]. 

^ Experiments with a time interval corresponding to 8 MTU-sized packets were carried out and 
the observed overhead was low. 
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performance given to all classes. The CIs are still equal for all classes but now they 
have a higher value (CIt 2 ) - Figure 1 (right part). 

The degradation of absolute values of transit delay between ti and t 2 is (d'l - di), 
(d '2 - d2), and (d'3 - ds), respectively for class 1, class 2, and class 3. It is clear that (d'3 
- d 3 ) is much larger than (d '2 - d 2 ) which is, in turn, larger than (d'l - di). So, class 1 
the one that is more sensitive to delay degradation was protected when the load 
grew, at the expense of class 3 that absorbed the major part of the degradation. 




Fig. 1. Congestion indexes for three traffic classes with different sensitivities to delay 
degradation 



2.2 Prototype Main Components and Architecture 

In order to evaluate the proposed ideas and to prove their feasibility a QoS-capable 
router prototype, supporting the D3 per-hop behaviour, was implemented. Figure 2 
presents its basic components: classifier; packet monitor; packet scheduler and packet 
dropper. The prototype allocates one independent queue to each class. The 
classifier/marker is responsible for determining each packet class and/or setting the 
information related to it in its header, following the strategy for IP packet 
classification/marking defined by the DSWG^ [21]. 

The monitor is responsible for determining the average package transit delay and 
loss for each class, and for calculating the correspondent congestion indexes. By 
default, it is configured to perform the calculations each 10 ms, but it may be 
configured to use different values. The frequency of Cl calculation should be defined 
as a trade-off between processing overhead and system responsiveness. 



^ DSWG Dijferentiated Services Working Group of IETF [DSERV] 
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The core resources of a router are its transmitter capacity and its memory. The role 
of the packet scheduler and the packet dropper is to dynamically distribute those 
resources among traffic classes. The dynamic nature of the currently proposed 
distribution contrasts with the much more common static allocation of resources to 
classes. According to the delay Congestion Indexes, the scheduler slows down the 
forwarding of packets of some classes and speeds up others. According to the loss 
Congestion Indexes, the dropper provides more memory to some queues removing it 
from others. The criterion to rule the dynamic distribution of resources is, as 
mentioned before, the equalisation of CIs. 



IP Level 



Tmr 

Input queue 



IP 

1 Packets 



fi\ 




Ouput queues 
ALTQ Technology 



Ethernet Card 



Ethernet Card 



Fig. 2. Router protot 5 ^pe architecture 

The basic scheduler operational principle is the one formally expressed in 
formula (1). This principle determined the scheduler design which was, nevertheless, 
influenced by some lessons learned with the study and test of other disciplines [27] as, 
for instance, the weighted fair queueing discipline [15], [8]. Early on, it was realised 
that it was not possible to use a pure work-conserving packet scheduler discipline in 
the platforms available at LCT-UC. In fact, the dynamics of the systems in use tends 
to serialise the appearance of packets in queues [27], [29], inhibiting any capacity of 
such disciplines to effectively differentiate traffic. One advocate, in agreement with 
other research work [17], that a scheduler should be able to pick the best part of the 
work-conserving and non-work-conserving worlds, namely i) the simplicity and good 
level of resource utilisation and ii) the capacity to maintain some part of the resources 
available for high priority traffic. The basic algorithm that supports the proposed 
scheduler is presented in Appendix A and detailed in [25]. 

The dropper developed at LCT-UC [26] includes also some important lessons 
learnt from experiments and from the work of other researchers [6], [18], [7], [16]. It 
avoids dropping TCP packets in order to protect TCP traffic from UDP traffic TCP 
has its own mechanisms for congestion avoidance triggered by packet drops whereas 
UDP has no such mechanisms. In addition, UDP and TCP packets are randomly 
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dropped. Most important, the dropper removes memory from some queues and 
allocates it to others, according to the load and QoS class needs. The criterion that 
rules the dynamie reallocation of memory is the equalisation of the elasses' drop 
Congestion Indexes. The basic algorithm that supports the proposed dropper is 
presented in Appendix B. 

The FreeBSD operating system [10], patched with ALTQ technology [8], was used 
to implement the router prototype. Its protocol stack was modified in order to inelude 
the components referred above. The scheduler and dropper were developed 
separately, and submitted to specifie tests [25], [26]. They were integrated in the 
prototype only at a subsequent stage of the projeet. In the next section the first test 
suite carried out on the router prototype is presented. 



3 Prototype Tests 

The main goal of the tests presented in this section is to prove the feasibility and 
effeetiveness of the coneepts proposed in this paper. 

The testbed used to carry out the tests eonsisted of a small isolated network with 5 
Intel Pentium PC maehines configured with a Celeron 333Mhz CPU, 32 MB RAM 
and Intel EtherExpress ProlOOB network eards. The prototype router ran FreeBSD 
2.2.6, patehed with ALTQ version 1.0.1, with 64MB RAM. Three hosts generated 
traffie direeted towards a destination host through the prototype router. Eaeh host only 
generated traffic of a given class, in order to guarantee the independenee of the 
generated packet streams. 

The tests were performed with the aid of two basic tools: 

V Nttcp [22] - to generate UDP packets with a given length and at a given rate. 

V QoS tat [3] - to monitor the kernel, namely the operation of the installed prototype. 
This tool was developed at LCT-UC. With it, it is possible to obtain data related to 
QoS provision in real time and also to ehange the most important operational 
parameters of the prototype. 



3.1 Tests - Fixing the Classes' Sensitivity to Loss Degradation 

In the first set of tests, the elasses' sensitivity to loss degradation was set to a fixed 
value, eorresponding to a degradation slope of 45 degrees for all classes. Conversely, 
class 1 and class 3 sensitivity to delay degradation varied with time: class 1 became 
less sensitive, class 3 became more sensitive, and class 2 maintained its sensitiveness. 
Table 1 presents the progression of eaeh class degradation slope with time. 
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Table 1. Evolution of classes' DSlopes with time during the test 



Time 


Class 1 


Class 2 


Class 3 


(s) 


Delay 

DSlope 


Drop 

DSlope 


Delay 

DSlope 


Drop 

DSlope 


Delay 

DSlope 


Drop 

DSlope 


0 


45 


45 


45 


45 


45 


45 


25 


40 


45 


45 


45 


50 


45 


50 


35 


45 


45 


45 


55 


45 


75 


30 


45 


45 


45 


60 


45 


100 


25 


45 


45 


45 


65 


45 



The QoStat tool was used in order to dynamically change each class' delay DSlope 
after each 25-seconds time interval - the changes were always applied simultaneously 
to all classes. As the objective was to evaluate the global prototype behaviour 
specifically under high load conditions, nttcp was used to generate fixed length 




Fig-3. Variation of the following values with classes sensitiveness to degradation: a) Average IP 
transit delay (over 1 second intervals); b) Number of packets sent per second; c) Number of 
dropped packets per second; d) Instantaneous IP output queue length (sampled every second) 
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packets of 1400 bytes, at about 60 Mbps (for each class). The results of this test are 
presented in figure 3. 

When the sensitivity to delay degradation changes, the scheduler reacts re- 
distributing the processing capacity among classes - the sharp transition shown in the 
graph of Figure 3a) reveals a good responsiveness of the system to the change. As 
class 3 becomes more sensitive to loss degradation, its packets experience smaller 
delays. The opposite happens with class 1 as it becomes less sensitive to degradation. 

Conversely, because the classes sensitivity to drop degradation has the same value, 
the number of processed packets per unit of time is roughly the same for all of them 
(Figure 3b). Nevertheless, we can see that as the difference between the best and the 
worst delay DSlope increases the system starts to diverge a little. This is due to a 
lesser effectiveness of the algorithm that maps the difference between the measured 
Cl for each class to the scheduler operational adjustment. The algorithms that convert 
CIs differences on scheduler and dropper operational adjustments are very simple and 
constitute one of the subjects for further study. 

We repeated the tests presented before, now changing the fixed values configured 
for the drop degradation slope. Instead of configuring the value 45 for all classes, we 
configured the values 30, 45 and 60, for class 1, class 2 and class 3, respectively. 
Those values were still maintained fixed during the experience. The average packet 
transit delay and number of sent packets measured under those conditions are shown 
in figure 4. One can see that the router is still able to react coherently when the delay 
degradation sensitivity is changed, but now differentiating classes in terms of their 
packets average transit delay, according to the configured delay DSlope. 




Fig-4. Variation of the following values with classes sensitiveness to degradation: a) Average IP 
transit delay (over 1 second intervals); b) Number of packets sent per second; 



3.2 Tests - Fixing the Classes' Sensitivity to Delay Degradation 

In the following step, a complementary set of tests was made. This time the sensitivity 
of the classes to delay was fixed (with a degradation slope of 45 degrees for all 
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classes), and their sensitivity to loss degradation was made to change during the 
course of the tests. Table II presents the progression of each class degradation slope 
with time. 



Table 2. Evolution of classes' DSlopes with time during the test 
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Fig-5. Variation of the following values with classes sensitiveness to degradation: a) 
Average IP transit delay (over 1 second intervals); b) Number of packets sent per second; c) 
Number of dropped packets per second; d) Instantaneous IP output queue length (sampled 



The results of this test are presented in figure 5. When the sensitivity to losses 
changes, the drop effort is re-distributed among classes. As class 3 becomes more 
sensitive to loss degradation, less of its packets are dropped, and so more of its 



294 Goncalo Quadros et al. 



packets are processed. The opposite happens with the class becoming less sensitive to 
loss degradation. This is a consequence of the re-distribution of the maximum queues' 
length among classes. When class 3 becomes more sensitive to loss it receives a 
greater share of the global queue space (Figure 5d). 



4 Main Contributions and Future Work 

This paper presented an approach to support traffic classes in IP networks, based on 
the Differentiated Services framework. The basic idea is to dynamically share the 
resources available in the network among existing classes, according to the measured 
quality of service and to the classes' sensitivity to QoS degradation. When applied to 
routers, the proposal leads to a new per-hop behaviour, moving from the current 
single-class-best-effort paradigm in use on the Internet to a multiple-class-best-effort 
paradigm. The approach is supported by a QoS metric developed at LCT-UC, which 
was also presented in this paper. 

One of the main advantages of the presented proposal is that it provides ways for 
an efficient use of communication resources, considering the classes needs related to 
performance. In addition, the approach does not imply any substantial changes to 
current IP technologies, which is another important advantage, if not the most 
important one. All the classes are treated as well as possible, considering that they 
have different sensitivity to performance degradation. 

To evaluate the presented ideas a router prototype was implemented. Its main 
components are a packet classifier, a monitor, a scheduler and a dropper. The tests 
that were carried out on the prototype proved the feasibility of the underlying ideas 
and showed that, despite the use of simple algorithms, the prototype revealed a stable 
and effective behaviour. 

Currently, as on-going work, the authors are studying ways to extend the presented 
ideas beyond the network element. More specifically, mechanisms for QoS-aware 
dynamic routing of traffic classes and for admission control based on the values of 
congestion indexes are being implemented. 
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6 Appendix A - Scheduler General Architecture 

There are two nuclear parameters related to the scheduler operation, which 
characterize each queue"^: dequeue time, which determines the instant of time 
after which the next packet of the class can be processed; and x_delay, which is 
used to update dequeue time. x_delay determines the time interval (in #s) that 
must elapse between the processing of consecutive packets of a given class. 

The following points concisely present the scheduler operation: 

1 . the scheduler visits each queue in a round robin fashion; 

2. in each visit, it compares the current time with the queueis dequeue_time; if 
the former is greater than the latter, a packet is processed; 

3. for the active class^ that is most sensitive to delay degradation (which is named 
class_l), x_delay is always zero; so the scheduler behaves as a work 
conserving scheduler for this class; 

4. for the remaining classes {classes_r), x delay will be greater than zero; the 
scheduler behaves as a non-work-conserving scheduler. 

The scheduler operation is logically presented in Figure A.l. 

The x_delay of the remaining classes (classes r) are dynamically adjusted so 
that the respective congestion indexes equalize the class_l's congestion index. CIs are 
calculated and x_delays are adjusted every n ms (n is configurable and experiments 



* Each class has its own exclusive queue. 

^ A class with packets recently processed by the scheduler. The frequency of evaluation of 
active classes is configurable, and should be closely related to the [h , ti+i] time interval 
referred in formula (1). 
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were made for n equal to 10 and 100 ms). All the packets processed by the scheduler 
during that time period are considered for the average transit delay calculation and for 
the respective Cl calculation (the short term Cl). Some tests using higher frequencies 
for Cl calculation were also carried out, showing that the used strategy does not imply 
a significant overhead. 



7 Appendix B - Dropper General Architecture 

There are two important parameters related to the dropper operation, which 
characterize each queue: q_physical_limit, the maximum possible length for 
each queue (which is arbitrarily large); and q_virtual_limit, the desired 
maximum length for each queue. 

Each time a packet is to be enqueued and the physical limit is reached it will be 
immediately dropped. As this limit is extremely large, this will be uncommon. Thus, 
every packet or burst of packets will normally be enqueued. 

From time to time the dropper is activated. This time is adjustable and should 
reflect the length of packet bursts to be accommodated (the default value used is 8 
packets). When the dropper is activated, it verifies whether or not there are more 
packets than the amount imposed by the virtual_limit for each queue. If this is the 
case, the dropper discards the excess packets in order for the limit to be respected. 

For each queue, TCP packets will only be dropped if they exceed a given 
percentage of the total number of packets. Both TCP and UDP packets are randomly 
dropped. The Dropper is logically presented in Figure B.l. 

The virtual limits for each class are dynamically adjusted, according to the loss Cl 
measured for them (the goal is to maintain the same value of Cl for all classes). The 
sum of the classes' q_virtual_limit is a constant (MAX_PCKS_ONQ, which 
corresponds to the maximum number of packets that can be stored in the queue 
system at any given time). The queue with the worst Cl receives some amount of 
buffer space taken from the queue with the best CL The buffer amount transferred 
from one queue to the other depends on the CIs difference. The timings for loss Cl 
calculation follow the ones for delay Cl calculation. 
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Fig. Al. Scheduler logical presentation® 



* This diagram reflects the two scheduler-development phases and, thus, it is not optimized. 
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Fig. B.l. Dropper logical presentation 
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Abstract. As soon as a network offers different service levels, pricing 
is needed to give incentives for not always using the highest quality of 
service (QoS). An open issue is how to derive a price for a certain service 
based on the transfer costs for data within this service. This paper shows 
that there is a need for a framework for end-to-end pricing between 
providers and for inter-domain interchange of price information. The 
presented framework leaves the freedom of choosing a certain pricing 
model within domains in order to foster competition among different 
providers. It uses a hierarchical, domain centered approach which can be 
easily deployed within Differentiated Services networks because it may 
use already available components for pricing purposes. 



1 Motivation 

Nowadays it is quite common to deploy new networks which offer much more 
bandwidth than currently needed. This is mainly done as it can be foreseen that 
future applications and the growing number of Internet users will demand far 
more bandwidth in the future. It would be nice to have a network capable of 
satisfying all user needs, i.e., a network that has a bandwidth high enough to 
fulfil all requests for data transfer in a minimum amount of time. However, right 
now and probably also in the next decade no one assumes that such a network 
can come true with current technologies, even when broadly deploying DWDM. 

As long as there are any bottlenecks within the Internet, packets may be 
dropped or may experience a longer delay as usual, i.e., under light load. There- 
fore no hard guarantees can be given regarding the characteristics of a data 
transmission, even statistical guarantees suffer from high variances. On the other 
hand, the Internet must support quality of service to become the common net- 
work for QoS-sensitive multimedia applications with specific requirements re- 
garding jitter or bandwidth. 

Nevertheless, a network supporting quality of service can only work efficiently 
if users do not always demand the best quality of service for themselves. The 
only way to assure this is to put the Internet on a solid economic base. Otherwise 
every user will try to transmit data with the highest possible quality without 
caring about other users and in the end nothing will change compared to the 
situation today. Thus, there must be certain incentives for users not always to 
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choose the best quality; this can be realized by introducing pricing schemes. This 
way users will only request as much quality of service as they really need because 
they have to pay more for more QoS. 

This paper is structured as follows. The following section discusses the Dif- 
ferentiated Services approach as the currently most promising architecture for 
QoS-provisioning in the Internet. The third section presents several requirements 
an open framework for pricing must fulfil to be flexible and efficient. After that, 
a description of the proposed framework follows. A brief overview of related work 
and a summary conclude the paper. 

2 Differentiated Services: An Open Architecture for QoS 
Provisioning in the Future Internet 

RFC 2475 defines an architecture for scalable service provisioning in the future 
Internet called Differentiated Services (DiffServ). Different service classes can 
be constructed by combining certain forwarding behaviors (Per-Hop-Behavior 
- PHB) in the routers. It is within the responsibility of each Internet Service 
Provider, which set of PHBs is offered to the customers within the Differentiated 
Services domains belonging to the ISP. This effect is desired with the idea in 
mind that the market decides - and not the IETF - which Per-Hop-Behaviors 
customers want. So the market of better services and the competition between 
the ISPs play the main part in defining the details and implementations of ser- 
vice provisioning in the future Internet. Differentiated Services only defines the 
framework. The IETF strictly avoids making any regulations for the ISPs. Cur- 
rently, in the IETF AAA working group a framework for accounting management 
is discussed ([!]). 

Differentiated Services is currently the main architecture for QoS-provisioning 
in the future Internet. Therefore, this paper uses DiffServ as basis for present- 
ing an open framework for intra- and interdomain pricing. Before going into 
the details of the proposed open pricing framework, one basic component of the 
DiffServ architecutre must be considered: 

In Differentiated Services domains a management entity will be needed. Some 
suggestions have been made, e.g. Bandwidth Broker RFC2638, Domain Man- 
ager [2], etc. They all have in common that they use a client-server-model for 
setting up a reservation within a DiffServ domain. To make a reservation, an end 
system has to contact its management entity, which checks the reservation for 
the local domain. If it is acceptable, the management entities of the consecutive 
domains will be called recursively. As it will be shown, the management entity 
can be used as a platform for a pricing framework. 

3 Requirements for an Open Framework for Pricing 

The process of setting a price is called pricing. The open issue in pricing is 
how much and how to pay for data transmissions using value-added services. 
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However, data transmissions can not be priced without taking into account other 
aspects of the economic structure of the Internet. Before setting prices, it must 
be obvious where, when and on what they are set - otherwise problems will 
arise. For example, a pricing scheme might require a huge amount of accounting 
information (defined in this context as a record of the used resources) and might 
cause an extensive overhead for exchanging this information. In addition, the 
relation between used resources and prices must be defined (this is called charging 
according to [3]). In the following the requirements for such an underlying system 
are stated, as well as requirements for an adequate pricing scheme. 



3.1 Charging and Accounting System 

— Flexibility: 

Currently, it remains unclear how the future Internet will support quality 
of service. As a result of this, a charging and accounting system should not 
depend too much on a certain quality of service architecture. It should rather 
be designed as an open framework with fixed interfaces, that allows for an 
easy adaptation to different architectures. 

Furthermore, the system requires not only flexibility concerning the quality 
of service architecture but also regarding the bandwidth of the transmission 
links. The bandwidth of Internet links is growing rapidly and a charging 
and accounting system has to be scalable enough to handle the tremendous 
growth-rates. 

An open pricing framework must be able to use different pricing schemes. 
Over the years it may be necessary to change the pricing schemes. These 
changes should not influence the underlying charging and accounting system. 
It should also be possible to use different pricing schemes at the same time. 
A pricing scheme that is effective in one part of the Internet might not be 
effective in another part. Different countries and different providers have 
different requirements for pricing schemes, too. 

~ Low overhead: 

The overhead for charging and accounting must be as low as possible. Espe- 
cially it should not reach the complexity of the accounting system in today’s 
telephone networks. If the costs of introducing and maintaining a charging 
and accounting system exceed its benefits, it will not be accepted. 

— Migration: 

To make the introduction of such a system as smooth as possible it should 
be possible to introduce it step by step. Some autonomous systems in the 
Internet should be able to use the new system and should still be able to 
communicate with domains that do not support it. 

~ Functionality: 

Charging and accounting of transmitted data must always be done exactly 
and correctly. The resource usage and the guaranteed quality of service must 
be measured in detail and should not put anyone in disadvantage. It is impor- 
tant that the accounting is not only done at access networks of the Internet 
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but also in its interior transit domains. This is the only way to guarantee a 
sound economic behavior in the Internet. 

A key factor is a solution of the problem of direction, that will be explained 
in detail in section 4. The question is whether a user should pay for data she 
receives or for data she sends. This is a real problem for network providers. 
A solution of it would give them a big incentive to introduce a charging and 
accounting system. 

.2 Pricing Schemes 

~ Flexibility: 

Pricing schemes must also be flexible. They have to adapt automatically 
to changing situations. How they do this is up to the network providers. 
They need mechanisms to influence the pricing scheme so that the automatic 
pricing is done according to their objectives. 

~ Incentives: 

The most important function of pricing is to give the right incentives. Users 
are influenced by prices and change their behavior accordingly. Therefore 
pricing has to be done in a way that the resulting behavior of users leads to 
fulfillment of several objectives. 

The first objective that comes to mind is the avoidance of congestion in the 
network. In order to achieve this, prices must be set in accordance to the 
load of the network and the load of services. If prices are increased demand 
will decrease and vice versa. If this is done in a sensible way the network 
capacity can be distributed more effectively. 

It is interesting to note that such pricing can also lead to a sensible extension 
of network resources. Critical locations in the network topology yield higher 
profits. The network providers will use these profits to extend their resources 
so that they might gain even higher profits. 

— Setting a price: 

For the sake of the users and for the sake of a healthy and competitive market 
prices should be fair. This means nothing else than that the prices should be 
set according to the actual costs. This especially means that prices must be 
set and paid where their corresponding costs originate. When transmitting 
data over networks operated by different network providers each provider 
must be paid according to its costs. 

~ Low effort: 

When implementing a pricing scheme it is important to make it simple to use 
and to understand. It might be straightforward to create very complex and 
exact technical solutions but these will not necessarily be the best solutions. 
It must not be forgotten that every pricing scheme must still be managed by 
network owners. Errors when setting prices that are due to the complexity 
of a pricing scheme can lead to fatal effects. 

Installing and maintaining a pricing scheme should require as low technical 
and financial effort as possible. The benefit of a pricing scheme must always 
be higher than the effort needed for realizing it. 
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4 Modelling 

The following sections describe our model of a charging and accounting system 
and a first implementation architecture. To be able to support different QoS 
architectures we concentrate on properties of the internet that are common to 
all approaches. Our goal is to first define a general and simple model which 
can be easily adapted to different QoS architectures and then concentrate on 
implementing it in a Different Services environment. 

4.1 Charging and Accounting System 

Domain-Centered Structure of the Internet A closer view onto the Inter- 
net reveals, that it is not a monolithic block but structured into domains and 
autonomous systems. These units are physically connected, but the main rela- 
tionship between these units can be described in the term of services, provided 
by service providers. A data transmission from a sender to a recipient can be 
described as a recursive service usage. The sender passes its data to the service 
provider it is connected with and relies on its forwarding capabilities. The ser- 
vice provider passes the data on to another service provider who again passes 
it on recursively to sucessive service providers until the data reaches the service 
provider the recipient is connected to and the data gets delivered. 



Problem of Direction The domain-centered structure of the Internet causes 
a basic problem of charging and accounting which might be called the Problem 
of Direction. It is a typical problem between service providers (cf. also [4]): It is 
not obvious, whether provider 1 has to pay for the traffic it sends to provider 
2 or for the traffic it receives. If the clients of provider 1 generate send much 
data (e.g., by transferring files to other locations via FTP), provider 1 should 
pay provider 2 for transporting the traffic. On the other hand, if the clients of 
provider 1 mostly request data (e.g, by performing file downloads or browsing the 
web where a small http request can cause the sending of large amounts of data 
from a server), provider 1 should pay for the data it receives. Thus, the problem 
arises, that not only the destination of a data packet, but also the context has to 
be taken into account, resulting in a large overhead. In order to find a solution 
to this problem, a look at other domain-structured networks might help. 

Example for Recursive Service Usage: Postal Service The Internet is 
a network for transporting digital data. Networks for transporting other goods 
have been existing for many years. The world wide postal service represents 
such a network. It also transports packets of different sizes and consists out of 
sub networks (when focussing on national postal services and ignoring global 
players). The sender hands his packet to the national postal service and pays 
for transportation to an abroad destination. The national postal service has 
contracts with other postal services and passes the packets to one of these, thus 
realizing a recursive service usage similar to that described above. 
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An interesting basic paradigm of the postal service concerning charging and 
accounting is the fact, that normally the sender has to pay for the transport. 
Transferring this to the Internet would mean that each participant of the In- 
ternet must pay for the data she sends. This seems to be a problem when data 
transmissions take place on request as it is the case in a WWW session. How- 
ever this case also exists in the postal service network. A buyer may order goods 
from a seller. The seller will then send the goods to the buyer and pay the postal 
service for the transport. Afterwards, the buyer will pay the seller both for the 
goods and for the transport. So payment negotiation between sender and receiver 
is done on a higher level which has also been proposed by [5] . The pricing frame- 
work just has to provide the technical basis for determining the transport costs 
of the sender. It is up to the sender, how to perform a billing of the transport 
costs; in the most cases this will be combined with the already existing billing 
of the costs of the content the server provided. 



Conclusions for an Open Pricing Framework The basic paradigm of the 
postal service, that the sender always pays for its data transmission, can be 
easily transferred to the Internet. Thus the problem of direction can be solved in 
a simple way. This strategy is plausible taking into account that it is not possible 
preventing someone else from sending data, but only oneself can avoid sending 
data. 

As the sender does not want to care about the transportation of its data 
once passed to the service provider, the sender does not want to pay all service 
providers on the way to the destination for data transport, but only the one it is 
directly connected with. The single relationship can be set up using the model 
of recursive service usage again: The sender pays its directly connected provider 
for the service of data transportation to whatever direction. In analogy to the 
described recursive service usage (provider 1 passes the data to provider 2 and so 
on), the providers have to pay recursively for the service of data transportation. 
The recursive service usage thus divides the price the sender pays among the 
providers. 



Price Information In order to implement the model of recursive service usage 
in the area of pricing, network providers must know in advance what the trans- 
port of data to a destination will cost. When using services that need an explicit 
connection setup the settling of prices can easily be done. This has already been 
shown in [3]. However, at least the Differentiated Services architecture also sup- 
ports services without such explicit connection setup. In order to have the current 
price information for these services it must be distributed throughout the Inter- 
net. This can be done in analogy to existing routing protocols like OSPF or BGP, 
as both protocols deal with the distribution of information between Internet do- 
mains. We propose a Border Pricing Protocol (BPP) enabling the spreading of 
price information in the Internet. Price information can be transmitted upon 
request, as a flooding of price information might cause a lot of overhead traffic. 
Entities implementing the BPP set up a price information data base for desti- 
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nations and perform an aging of the entries. When an entry changes, the BPP 
can be used to transmit the change information to entities that requested price 
information before via a notification request. 

We suggest adapting existing procedures for distributing and utilizing rout- 
ing information in order to additionally distribute and utilize price information. 
Routing protocols have been extensively investigated and it is very sensible to 
make use of them instead of developing new protocols for distributing informa- 
tion similar to routing information with respect to where and when it is required. 
Routing is also a common factor of all QoS approaches for the Internet and thus 
this strategy makes it possible to adapt the distribution of price information to 
different QoS approaches. It is true that in practise only simple routing pro- 
tocols or even static routing is used since it is more practical. However, it is 
sensible that this strategy is also practical for distributing price information. It 
would mean that every provider has the choice how dynamic he wants his price 
information to be. Short term pricing due to congestions would be possible by 
adapting e.g., OSPF or BGP and long term pricing influenced only by marketing 
and other economic decisions would be possible by setting static prices. 



Relationship to the Differentiated Services Architecture As the Differ- 
entiated Services architecture is also centered around domains it is reasonable 
to integrate the entity executing the BPP within the management entity of 
the DiffServ domain. The domain manager presented in [2], which already sup- 
ports reservation setup and resource management, can easily be extended to 
offer charging and accounting of the domain, too. Thus there is one central en- 
tity in each Differentiated Service Domain which is responsible for managing the 
routers, especially the routing, pricing and charging. Figure 1 shows the relation- 
ship between different DiffServ domains. The integration of pricing mechanisms 
does not produce much overhead, as the presented framework does not require 
the collection of a large amount of accounting information. Service providers can 
store an accounting record for every data transmission but they can also simply 
store a single ’’money counter” for each service class and each other network 
they are connected to. 

As with DiffServ, in the beginning not all service providers will adopt the 
pricing framework and will support the Border Pricing Protocol. As the recursive 
service usage does not require a protocol communication from end to end this is 
no obstacle. In the case of a service provider not supporting BPP the financial 
settlement with this provider can be done the same way as before, in most cases 
by means of long-term contracts. 

As stated earlier, there is no need for a pricing framework to define the 
way charging and accounting inside a domain are realized. Different providers 
could implement different internal mechanisms or Interior Pricing Protocols 
(IPP), protocol standardization is only needed for the communication between 
providers. The methods of an Interior Pricing Protocol are not focus of this pa- 
per, as charging and accounting inside a domain will remain an area of provider- 
specific interest and will be dominated by many non-technical aspects. Never- 
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Fig. 1. Charging between Different Services Domains 



theless, combining BPP and a dedicated IPP inside the management entity of 
the Differentiated Services domain will provide the highest benefits. 



Charging the End User Charging the end user can be done in many different 
ways. The most complex way is to transmit all price information directly to the 
user. While this is the most precise way it is probably not the best. The user 
would have to pay close attention to the current price of data transmissions. 
She could be supported by cost tools and the current price information could be 
displayed in the status line of web browsers, etc. Nevertheless, a far more simple 
way to charge the end user is to stop the distribution of price information at 
the user’s provider. The provider and the user can agree on a certain charging 
scheme, for example a service-specific fiat rate, and the provider carries the 
financial risk. An end user, of course, will select a provider whose charging scheme 
fits her needs best. 

4.2 Pricing Scheme 

Pricing is one instrument of the so called 4 Ps (Product, Price, Place, Promo- 
tion). These four instruments are important parts of every corporation’s strategy. 
According to the founder of the theorie of the 4 Ps only their combined usage can 
lead to success in a market. Therefore it is very unlikely that corporations will 
accept automatic pricing schemes on which they have little or no influence. As 
already argued in the case of charging and accounting, much more in the case of 
pricing there is no need for a standardized scheme in all Internet domains. The 
ideal free service market already guarantees that the incentive requirements and 
the cost orientation for pricing schemes will be fulfilled. Thus providers choose 
their own pricing schemes, which e.g., can be integrated in the domain managers 
of Differentiated Services Domains. 
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4.3 Implementation 

In order to demonstrate the usability of the presented approach it was merged 
with some of our earlier work. In [6] we have presented a testbed for Differ- 
entiated Services. It was enhanced to demonstrate the model presented here 
by adding the functionalities of collecting accounting information and charging 
transmitted data. 




Fig. 2. Processing of price information and normal data packets 



Figure 2 shows the flow chart of the implementation of pricing mechanisms 
for best effort traffic. Incoming packets have to be classified first according to 
their type. BPP protocol units containing price information or price information 
requests have to be separated from packets which simply have to be forwarded. 
Thus there are three cases: 

— New price information is received and added to the price information data 
base. If there were notification requests the pricing information is forwarded 
to the requesters. Thus, a broadcasting or flooding of price information is 
avoided. 

~ A price information request was received. If the pricing data base contains an 
appropriate entry, the response is sent back, otherwise, the pricing request 
is forwarded along routes to the requested destination 

— For each incoming packet the price information data base is looked up for 
an entry of the destination of the packet. If the price to the destination is 
known, the packet is charged. Otherwise, a pricing request is issued along 
the path to the destination. 
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An age counter is associated with each entry in the price information data base. 
All age counters are decreased periodically and reset, if the entry is used. Entries 
whose counters reach zero are deleted. Thus the price information data base does 
not get too big. 

Currently we are working on the detailed design and implementation of the 
before mentioned domain managers. During this work we will integrate the man- 
agement of routing and price information. We will also integrate charging for 
services with connection setup so that finally all sorts of traffic can be charged. 
We will then evaluate the system. 



5 Related Work 

Currently, there are several models for pricing in the Internet. The following will 
give a brief overview of six different models. 



Smart Market [7]: The model tries to optimize the benefits for all users by 
applying a responsive pricing mechanism. Each user has to assign a certain bid 
(in terms of money, credits etc.) to each data packet sent into the network. 
The first gateway collects data packets within a certain interval and sorts them 
according to their bids. Then, the gateway transfers packets starting with the 
packets carrying the highest bid until it reaches the capacity of the outgoing links 
within this interval. The highest bid of a packet which has not been forwarded 
determines the cost that will be charged for each forwarded packet from the 
senders. 

The model is based on a very simple view of a network and considers only 
the network access. Furthermore, the calculation of prices is not transparent for 
users, especially if the model is applied to different service classes. Finally, the 
model assumes the auction of similar goods which is not the case for data packets 
as their size may vary. 



Upper Bound [8]: This model, which is easy to understand for a user, is based 
on a traffic contract which comprises two parameters, a and b. The provider may 
offer several such pairs and calculate the actual cost using the simple formula 
cost = a*d + b*v, with d being the duration of a connection and v being the 
volume of transmitted data. In order to avoid unfairness due to bursty traffic, 
the effective bandwidth of a user is calculated. The effective bandwidth is based 
on a leaky bucket model of the traffic. The user has to choose a traffic contract 
(a,b) which gives an upper bound for the effective bandwidth, i.e., a user has to 
estimate the average expected traffic. 

Again, only the access to a network is considered. 



Arrow [3,9]: This model is mainly based on traffic ffows as considered in the 
Integrated Services architecture and extends the signaling protocol RSVP with 
charging information. Another approach utilizing RSVP is decribed in [10]. At 
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flow-setup the cost of each flow has to be negotiated. The cost of a flow is 
defined as the sum of the costs at each router. The model offers four different 
services classes: deterministic guaranteed, statistic guaranteed, best-effort high 
priority, and best-effort low priority. The network provider assigns a base price 
to each service class. This price may vary over time and depends on long-term 
predictions. Based on this price the operator calculates the price by multiplying 
the base price with the data volume, the duration of the connection (flow), and a 
penalty function. The penalty function reflects the violation of a traffic contract. 

The approach is very pragmatic and technical oriented. However, due to the 
scalability problems of RSVP in large scale networks it is not clear how these 
extensions can be applied additionally. 



Provider- Oriented [fi]: As the name already implies, this approach tries to 
maximize the benefits of a network provider. The model is based on two axioms: 
pricing must be linear (i.e., double resource usage implies double costs) and a 
single bit has always the same price, regardless of the service class. This model, 
too, is based on the Integrated Services architecture. Using a leaky bucket to 
describe the service classes guaranteed, guaranteed rate, and controlled load, the 
model establishes a system of linear equations that has to be solved to optimize 
provider benefits. 

However, already the two axioms the model is based on do not seem to be 
realistic. It may be beneficial for a provider to sell larger chunks of bandwidth to 
smaller resellers. Furthermore, equal costs per bit independent of a service class 
lead to the use of the highest class by all users independent of their real needs. 



Paris Metro [12]: This approach is based on a very simple self-regulating 
mechanism. Bandwidth is split into several channels, e.g. 3 or 4. The transmission 
over each channel has a certain price with one channel having the highest price 
(1st class) and one the lowest price (4th class). If the network is only lightly 
loaded, all traffic will use the lowest class which is the cheapest. At higher load, 
some traffic will choose more expensive classes because of the advantage of less 
load within these classes. Each router implements a different queue for each class, 
each packet is tagged with its class. 

This approach is very simple to understand and easy to implement. However, 
it does not give any bandwidth guarantees. Furthermore, it may be difficult for 
a user to determine, if a class has enough capacity and thus it is worth switching 
classes. 

6 Conclusion 

This paper outlines an open and flexible framework for charging and account- 
ing. The framework only assumes certain entities within administrative domains 
that can exchange charging and accounting information but does not impose 
any certain pricing schemes on service providers. Similar to routing protocols 
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the architecture defines a BPP (Border Pricing Protocol) to exchange price in- 
formation between administrative domains, but leaves the exact definition of 
IPPs (Interior Pricing Protocols) to the providers. The entities handling charg- 
ing and accounting could be co-located with, e.g., bandwidth brokers, as foreseen 
in the Differentiated Service architecture. However, the architecture does not as- 
sume a certain QoS-architecture. To keep the overall overhead of exchanging 
price information low, this information can be exchanged together with routing 
information. Our system is able to charge best effort transmissions accross the 
Internet and not only the access to the Internet. This is basically done by recur- 
sively forwarding price information from one provider to another following the 
path of data transmission. 

We think that an open market will decide which pricing scheme fits which 
purpose best and thus we do not enforce a certain scheme but offer an architec- 
ture for implementing pricing schemes. Currently, we are testing our architecture 
in a small Differentiated Services testbed, the next step is a larger deployment 
with more test partners and the implementation of different pricing schemes. 
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Abstract. Mobile telephone communications and the Internet are 
converging and may eventually operate on a common technical 
platform, using TCP/IP networks as the main backbone medium. 
Mobile telephones are converging to Internet terminals, allowing the 
user access to email, Web browsing and all the other Internet services 
currently available from a desktop computer environment. In order to 
provide improved infrastructure for Global System for Mobile (GSM) 
based Internet services using 2"** and generation (2G and 3G) the 
mobile network providers have the requirement to generate revenue for 
the services they provide. To do this the mobile network providers first 
need to capture the charging and billing data from the network. This 
paper describes the evolution of the GSM telephone networks and 
future mobile Internet services, via the General Purpose Radio Service 
(GPRS) and Universal Mobile Telecommunication System (UMTS). 
The methods for collecting the charging and billing information and 
charging models for processing this data into subscriber bills are 
discussed. A selection of proposed Internet charging models are applied 
to the mobile network market and their relative suitability is examined. 

Keywords: IG, 2G, 3G, GSM, GPRS, UMTS, Charging, Billing, Mobile, Internet. 



1 Introduction 

The Global System for Mobile (GSM) was first introduced in 1992 with 
approximately 23 million subscribers, rising to over 200 million in 1999 on over 300 
GSM networks [1]. The aim was to provide a global mobile telephone network that 
could be implemented using standard building blocks not tied to specific hardware 
vendors. The uptake of GSM by subscribers is far higher than any industry 
predictions and typifies the 1990 s and the increasing need for personal mobility. The 
U* generation (IG) GSM mobile networks provide subscribers with high quality voice 
communications and low bandwidth data connections for FAX, Short Message 



J. Crowcroft, J. Roberts, and M. Smirnov (Eds.): QoflS 2000, LNCS 1922, pp. 312-323, 2000. 
! Springer- Verlag Berlin Heidelberg 2000 




Evolution of Charging and Billing Models 313 



Service (SMS) and full dial-in connection to the Internet for email and web browsing, 
usually requiring a mobile computer or intelligent handset. The addition of overlay 
communication protocols, such as Wireless Application Protocol (WAP) [2], allow 
mobile handsets on IG GSM networks to be used for secure connection applications 
such as mobile banking and other transaction based services. International roaming 
agreements between the numerous mobile network providers allow subscribers to be 
reachable almost anywhere in the world where there is GSM coverage using the same 
telephone number and handset. Satellite based services such as GlobalStar [3] and 
ICO [4] allow GSM subscribers to further expand their network coverage and 
availability using the same mobile communications infrastructure. The increasing use 
of mobile telephones and devices for data communication drives the need from the 
market for a fast, reliable and available infrastructure. GSM proposes to provide the 
required infrastructure using 2“‘* and (2G and 3G) generation GSM which 
introduce new technology that allows increased data bandwidths and new data 
services [1]. 2G GSM introduces the General Packet Radio Service (GPRS) and 3G 
GSM introduces the Universal Mobile Telecommunication System (UMTS). 

The introduction of 2G and 3G GSM technology brings convergence of GSM mobile 
networks with the Internet. Packet Switching [5] is being introduced as the switching 
mechanism for data calls and Internet sessions, in contrast with the current circuit 
switching implementations currently used in IG GSM and fixed line telephony 
networks. The 2G and 3G technologies deliver the same services available from the 
desktop Internet today, including email, secure transactions and Web browsing 
become available on mobile devices, using the standard infrastructure of the Internet. 
In order for the mobile networks to be able to offer these additional services to the 
customers there is a requirement for the recovery of the infrastructure investment cost. 
This is a prime justification and motivation for charging and billing for telephone 
network usage together with the need for generating commercial profits for telephone 
network shareholders and companies. Charging may also be used to provide 
congestion control in under-provisioned and over-subscribed networks. 

2G and 3G GSM networks present the operators with many charging and billing 
challenges. The experience gained with charging and billing with GPRS will prove 
valuable when UMTS is being rolled out in GSM networks. There are various 
proposed economic and technical models for eharging and billing for Internet usage. 
Most of these are equally suitable for eharging and billing of mobile network traffic, 
especially with 2G and 3G GSM systems. 



2 GSM Mobile Networks and the Future Internet 

IG GSM networks [1] provide high quality digital telephony with low bandwidth data 
communications for FAX and SMS. GSM networks are typically multi-vendor and 
consist of a layered architecture including the mobile handsets, the telephone network 
and the subscriber invoices and bills. The Base Station and the Network Subsystems 
are often referred to as the Operational Network (ON), and is usually physically 
distributed around the area of coverage of the GSM network. The ON elements are 
often sited remotely with wide area networking (WAN) connectivity to the rest of the 
network to allow centralised remote administration of the network. The Base 
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Transmitter Stations (BTS) and the Base Station Controllers (BSC) provide the air 
interface for GSM, which is then circuit switched [5] using the industry standard SS7 
switching by the Mobile Switching Centers (MSCs) in the ON. Additional Gateway 
MSCs allow switching to other mobile and fixed line telephone networks, providing 
interconnection and roaming. Billing tickets for all calls made in the network are 
produced on the MSCs, based on subscriber Ids in the network. 

The Operational Support Systems (OSS) provides the interface to the customer 
invoices and bills, and normally includes systems for billing, subscriber 
administration, GSM Subscriber Identification Module (SIM) chipcard production, 
fraud detection, voicemail and off-line data-mining and analysis systems. Most 
mobile networks de-couple the ON from the OSS using Mediation Systems or 
Mediation Devices (MD). These systems are used to collect billing data from the ON 
and also to manage the subscriber databases in the ON elements. 

The collection of the billing data is normally via high-speed communication links 
using reliable data protocols such as File Transfer and Management (FT AM) and 
X.25. Once billing data is collected centrally it can be processed into subscriber 
invoices and bills using dedicated billing systems and the mobile network s charging 
tariffs. The billing data can also be further processed by additional data-mining 
systems to detect subscriber s usage patterns, possible fraud detection and subscriber 
profile surveying. 

With 2G GSM the General Packet Radio Service (GPRS) [1,6,7] is introduced 
providing an overlay service for Internet access that shares the same air interface as 
IG GSM. The design goal behind GPRS is to provide high-speed Internet data 
communications for mobile devices and subscribers using the existing IG GSM air 
interface, thereby minimising the cost impact on the existing installed network 
infrastructure. GPRS is implemented in an existing GSM network with the addition of 
two new ON elements the Signalling GPRS Service Node (SGSN) and the Gateway 
GPRS Service Node (GGSN). 

Additional modifications to the existing BTS and BSC to include Packet Control 
Units are also required so that the network is GPRS aware. The two new ON elements 
provide the interface between the GSM air interface and the TCP/IP network used for 
the GPRS specific traffic, (i.e. Internet sessions used for email, http, ftp etc). GPRS 
has the advantages of digital telephony of GSM combined with increased bandwidth 
over the air interface for data traffic. The GGSN and SGSN in the ON provide the 
switching for the mobile data sessions and use packet switching [5]. GPRS data 
sessions are routed by the MSCs as for IG GSM with the SGSN and GGSN routing 
the Internet sessions to the TCP/IP network, using packet switching [5]. Packet 
switching makes full use of the available bandwidth of the underlying network, but 
often has a reduced Quality of Service, and is suited to bursty network traffic 
including Internet protocols such as http, ftp, email etc, where guaranteed qualities of 
service are not a top priority. In addition to introducing TCP/IP packet switching 
GPRS equipped mobile networks may roll in IPv6 [8] as the preferred IP protocol. 
This will allow the large number of addressable network nodes that will be required 
when there is a high saturation of mobile devices requiring Internet connectivity. 

The GGSN and SGSN produce billing tickets and statistical data relating to Internet 
traffic usage generated by GPRS calls and sessions. 
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2G and 3G GSM brings with it a new set of parameters to the challenge of billing and 
charging subscribers for using the GSM mobile networks. Mobile network subscribers 
are normally charged on a time and usage basis for the high quality telephony. With 
2G and 3G GSM there are new possibilities to charge the subscribers for how much 
data or bandwidth they use in the network, in addition to the amount of talk-time 
consumed. This shares commonality with the possibilities currently being proposed 
for Internet charging and billing. As in the Internet the cost of packet counting may 
be more expensive than the value of the packets being counted. These new challenges 
need to be met and addressed by the network operators. 

3G GSM mobile networks arrive with the introduction of Universal Mobile 
Telecommunication System (UMTS). This will be based on the standard ON and OSS 
GSM architecture with the addition of UMTS specific Network Elements. It will build 
on the infrastructure installed for GPRS with a marked increase in maximum 
bandwidth to 2Mbits/sec [1]. 

Supported applications for 2G and 3G GSM may involve Internet intensive activities 
such as web browsing and email communication, as well as traditional mobile 
telephony. Table 1 shows a comparison of the bandwidth and communication rates 
achievable with the different generations of GSM networks [1]. 



Table 1. GSM Architecture Generations 



Generation 


Year 


Technology 


Max. Data Bandwidth 


P* 


1992 


Voice 


N/A 


P* 


1995 


SMS & Mobile Data & FAX 


9.6 Kb/sec Internet via Modem 


^nd 


2001 


GPRS 


1 1 5 Kb/sec Direct Internet connection 


3rd 


2002/3 


UMTS 


2 Mb/sec Direct Internet connection 



The introduction of GPRS is considered a stepping stone to the promises and 
functionality of UMTS and high-speed access to Internet services and Networking. 
Many mobile network operators view the experience to be gained with GPRS and the 
associated billing issues essential for the implementation of systems for UMTS as it 
may be too late to learn when UMTS is implemented and available for the mass 
market. The systems and methods developed for GPRS charging and billing need to 
be compatible with the requirements of UMTS to ensure preservation of investment, 
infrastructure and knowledge. 



3 Infrastructure for Charging and Billing 

In order to charge for mobile telephony services the network operator has to first 
capture the network usage of all of the network s users including subscribers and 
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roaming subscribers. This usage data then needs to be processed and then set against 
the billing and charging models and tariffs in use. 

Billing tickets need to be collected and processed centrally so that the subscriber bills 
can be produced. The collection of billing tickets is often done by a mediation system. 
These systems may also carry out vendor specific translations on the billing ticket 
formats so that multi-vendor ONs can be implemented, or to allow the native billing 
tickets to be used on commercial billing system, or on other centralised OSS systems 
used for data-mining. The heterogeneous nature of most mobile networks may be very 
problematic with many different file formats and standards being involved. Once all 
the billing tickets have been collected and pre-processed into a standard format that 
the billing and other OSS systems can understand they may them be used to produce 
the invoices and bills for the subscribers. 

The ETSI [1] standards recommend a Charging Gateway Function (CGF) to handle 
the billing record generation. Two billing records are generated, one by the GGSN for 
the Internet part and one by the SGSN for the mobility (radio) part. 

Current GSM billing systems have difficulties implementing charging for IG GSM 
non-voice services and it is unlikely that existing billing systems will be able to 
process the large number of new variables introduced with 2G and 3G networks. 

An added complication for GPRS charging is the overlap and convergence to the 
Internet and the multitude of diverse systems connected to it. In addition to the inter- 
charging between the mobile and fixed telephone networks inter-charging between 
the mobile networks and Internet providers will be required and this will add to the 
operational costs of running the 2G and 3G services in parallel to the IG GSM 
network. With the already proposed Internet charging models the inter-charging 
between the mobile and Internet network providers has the potential to become very 
complicated and may include requirements for additional billing and charging 
systems for the required accounting. The addition of GPRS to the GSM mobile 
network modifies the call flows for Internet packet data as in Figure I and includes 
the required gateway to Internet services and external networks: 




Fig. 1. GSM 2G GSM Call Flow with GPRS 
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Network operators may also have packet counting systems in the network that will 
produce additional billing and charging information that may require processing by 
the billing systems. 

The GPRS related billing tickets may be of a different format to the ones produced by 
the MSCs and may include data on the amounts of packets exchanged during GPRS 
sessions. Extensions to the mediation methods may be implemented for the collection 
and pre-processing of the GPRS related billing tickets that may then be fed to the 
billings systems. For 3G there will be the addition of UMTS Mobile Switching 
Centers (UMSC) for UMTS specific traffic. 

Once the billing ticket information has been collected from the network the mobile 
network requires a billing and charging system to make sense of all the data and 
produce the invoices and bills for the subscribers, and also to produce the cross- 
charge data for partner network providers. The actual cost of providing and 
maintaining such a billing system may be anything up to 50% of the total 
infrastructure investment and annual turnover of the mobile network. The billing 
system therefore needs to be able to provide a good deal of added value to make the 
investment worthwhile. This provides a valid justification for simplifying the billing 
function and investigating the use of charging models based on fixed price 
subscriptions and bulk purchase of talk time and data bandwidth in the network. 

Most mobile network operators currently offer contract subscriptions, which include a 
line rental element plus a contract rate for telephony airtime, usually based on call 
duration. In addition to the contract subscriptions the network operators offer pre- 
paid contracts where the subscriber pre-pays for the airtime used in the network. 
From a commercial viewpoint pre-paid makes sense for the network operators, since 
they will receive payment from subscribers prior to the consumption of resources. 
This simplifies revenue collection, but with the downside of increased complexity in 
the ON to prevent subscribers from over-spending on their pre-paid subscription. 
With the addition of Internet access via GPRS and UMTS existing mobile network 
subscriptions need to be extended to include charging for the Internet services used by 
the subscribers. Just how to charge for the Internet services offered to and used by the 
subscribers is the major challenge to the mobile network providers. 

In most commercial environments some kind of fraud is normally present. Mobile 
networks are no exception. The vast array of billing ticket information produced by 
the ON in mobile networks can be processed offline and used effectively for fraud 
detection. Again the infrastructure investments for such systems are high and their 
added value has to be proved. Fraud detection fits quite nicely into the billing and 
charging models and they often go hand in hand. An example of fraud in GSM 
networks is the running up of large bills on stolen mobile phones. This can be 
detected using the billing data and the mobile phone being used can be blocked in the 
network, but incurs a high cost in the real-time monitoring of the network traffic data 
and the associated systems and staff With the addition of 2G and 3G GSM the 
opportunity for fraud increases and the network operators need to be aware of the 
different kinds of fraud that are possible and may occur. 
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4 Charging Models 

There are many charging models that have been proposed [9] for the current and 
future Internet as well as those traditionally employed by the mobile and fixed line 
telephone networks. Most, if not all, of the Internet charging models are equally 
applicable for use in the mobile networks, especially with the introduction of 2G and 
3G GSM systems. Below is a discussion of some of the proposed charging models, 
and how they can be adapted to the mobile network markets. 

Metered Charging 

This pricing model is already in use with many Internet service providers (ISPs) and 
European mobile and fixed line telephone networks. The model involves charging the 
subscriber for the connection to the service provider on a monthly basis and then 
charging for metered usage of the service. The usage is usually measured in units of 
time and there is often a free period of usage included with the monthly fee. 
Variations on this model include having scaled subscription charges that increase with 
the metered usage. 

The use of this model in 2G and 3G GSM networks may become commercially 
problematic since subscribers may leave GPRS sessions open endlessly without the 
handset being powered on. Metered charging based on time for such usage may prove 
prohibitive. However, if the usage is based on other session parameters, for example 
number of packets transmitted/received, then the commercial impact becomes less 
and the model may be usable in mobile networks for data. 

Fixed Price Charging 

This pricing model is similar to that used by some US fixed line telephone networks 
for local call charging. The network service provider sets a fixed rental charge for the 
telephone connection and all local calls are then free of charge with metered charging 
used for long-distance calls. 

The advantage of this charging model is that call data for local calls does not need to 
be collected and processed, providing a commercial saving for the network operator 
in the billing systems and mediation systems infrastructure. 

Disadvantages of this model include no added revenue for the service providers in 
times of above average usage on the network, and congestion may also become an 
issue if the network is under provisioned for the number of possible subscribers at 
peak times. This provides a strong argument for using charging and billing to improve 
congestion control by dissuading subscribers from using the network through higher 
cost for the provided services. 

Packet Charging 

Packet Charging is specific to Packet Switching [5] networks and involves the 
capturing and counting the number of packets exchanged in a session. This is a 
proposed method of metering Internet traffic and being able to cross-charge between 
networks as well as ISP and mobile subscribers. This model requires the 
implementation of packet counting systems in the network and complex billing 
systems that can process the packet data on a subscriber and customer basis. 
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The advantage of this method of charging is that the absolute usage of the network 
and services can be metered, calculated and billed for very accurately, as long as the 
packet information can be captured efficiently. 

The major disadvantage of Packet Charging is that the cost of measuring the packets 
may be greater than their actual value, both from an infrastructure investment and 
additional network traffic viewpoint. This may lead to packet charging being used as 
a policing tool to ensure that network bandwidth is used efficiently and not over 
consumed by the network subscribers, rather than as a direct charging model. 

Expected Capacity Charging 

This charging model [9] allows the service provider to identify the amount of network 
capacity that any subscriber receives under congested conditions, agreed on a usage 
profile basis, and charge the subscriber an agreed price for that level of service. The 
subscribers are charged for their expected capacity and not the peak capacity rate of 
the network. Charging involves using a filter at the user network interface to tag 
excess traffic; this traffic is then preferentially rejected in the network in the case of 
network congestion but is not charged for; charges are determined by the filter 
parameters. 

This model has the advantage that the price to the subscriber is fixed and predictable 
which in turn permits the network provider to budget correctly for network usage. The 
expected capacity model also gives the network provider a more stable model of the 
long-term capacity planning for the network. This model fits is well with mobile 
networks and the administration of the agreed expected capacity would be done as 
part of the normal subscriber administration tasks. 

One disadvantage is that the network operator has to police the actual capacity of the 
network used by subscribers and act accordingly by limiting the subscribers service to 
what has been purchased, or by invoicing the subscriber for the extra capacity used, 
on a metered tariff for example. 

Edge Pricing 

Proposed in [10] this model charges for the usage at the edge of the network scope 
for the subscriber, rather than along the expected path of the source and destination of 
the calling session. The networks in turn then cross-charge each other for the usage at 
the network edges . Edge pricing refers to the capture of the local charging 
information. Once captured the information can be used for any kind of charging 
including metered, fixed or expected capacity, for example. Past research [13] has 
shown that much of the observed congestion on the Internet is at the edges of the 
individual networks that make up the Internet. The use of edge pricing may be 
effective as a policing method to monitor and alert the network operators to such 
congestion. 

This approach has the advantage that all session data can be captured locally and does 
not involve exchanging billing data with other networks and partners for subscriber 
billing, as for current roaming arrangements between mobile networks. 

A disadvantage with this model is the lack of visibility of the routing via external 
networks and the costs of that traffic to both networks. The cost of collection of the 
data may again also be an influencing factor in the selection of this method, as for 
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Packet Charging above. The cost of collecting the edge usage information may be in 
excess of the value of the collected information. 

Paris-Metro Charging 

This charging model, proposed in [11], introduces the concept of travel class, as used 
on public transport systems, to network traffic and relies on providing differentiated 
levels of service based on customer usage pricing only. The scheme assumes that 
subscribers will assign a preferred travel class with an associated cost for their 
different network traffic. The class assigned may be simplified to first and second 
class, as used on the Paris Metro system that inspired this charging model. The choice 
of class may be made dynamic and the subscriber may also use the throughput of the 
network to determine which class to use for their required traffic. The network may 
become self-regulating at periods of high usage. When the network becomes 
congested and all the capacity in first class is filled subscribers may downgrade to 
second class to improve their own network performance. 

This charging model may work well in GPRS and UMTS networks and allow 
subscribers to prioritise network traffic, for example business emails may be 
considered more important that personal email so the cost penalty for first class may 
be considered appropriate for business email. 

An advantage of this charging model is the flexibility given to the network 
subscribers and also the control they have over the cost of their network traffic. 

This model has the disadvantage of introducing mathematical complexity to the 
network s behaviour and a tariff class decision overhead to the network subscriber. 
For this scheme to work the class decision may need to be made automatic, and may 
involve extensions to network communication protocols and the application software 
handling the network traffic. This charging model also requires the network 
bandwidth to be segmented and therefore does not allow multiplexing. 

Market Based Reservation Charging 

This charging model discussed by [12] and usually attributed to Mackie-Mason, 
introduces the concept of a public auction of bandwidth or network resources. The 
network subscribers place monetary bids that will influence the quality of service they 
receive from their network-based applications. This model may be used in the mobile 
networks by having subscribers to the network maintaining a preferences profile that 
details the subscriber s bids for the various services used, for example email, voice, 
http, ftp and SMS. The network provider may then use the subscriber s preference 
profile when routing the network traffic. In the case of GPRS networks the subscriber 
preference profiles may be administered via a WWW page and browser or possibly by 
SMS. 

Subscribers have the advantage that they can influence their quality of service from 
the mobile network by the value they attach to the services they require. 

As a disadvantage this charging model introduces some uncertainty to the subscribers 
with regard to the quality of service in the network. It may also allow some of the 
subscribers to gain unfair advantage when they have bid for certain services at the 
expense of other subscribers and network users. This charging model is widely agreed 
to be unimplementable for Internet networks, and maybe also for the mobile 
networks. 




Evolution of Charging and Billing Models 321 



Summary 

In Table 2 the cost of implementation covers the infrastructure capital investment in 
new equipment and software to enable the use of the charging model. The overhead 
on the network of the charging models includes, but is not limited to, the additional 
network traffic required to implement the model, and the addition of any new systems 
for data collection and processing over and above the standard GSM building blocks. 
Overhead to the subscribers include added complexity of tariffs and the maintenance 
of the subscriber s account to use the charging models efficiently and to avoid 
excessive charging by the mobile network provider. 



Table 2. Charging Model Comparison 



Charging 

Model 


Cost of 

Implementation 


Network 

Overhead 


Subscriber 

Overhead 


Provision for 
QoS 

Improvement 


Metered 

Charging 


High/Medium 


Low 


Low 


No 


Fixed Price 
Charging 


Medium/Low 


Low 


Low 


No 


Packet Charging 


High 


High 


Low 


No 


Expected 

Capacity 

Charging 


Medium/High 


Low/ 

Medium 


Medium 


Yes 


Edge Pricing 


Medium/Low 


Low/ 

Medium 


Low 


No 


Paris-Metro 

Charging 


Medium 


High 


Medium/ 

High 


Yes 


Market Based 
Reservation 
Charging 


Medium/High 


High 


Medium/ 

High 


Yes 



Support from the communication protocols in use on the mobile networks may be 
required to allow some of the charging models above to be implemented. This is 
needed to allow Quality of Service (QoS) provision [14] within the networks and also 
provide identity tagging or other information based on the charging model 
requirements. Reservation protocols such as Resource ReSerVation Protocol (RSVP) 
may be used in the network to provide the support for QoS, and move away from the 
best-effort service as currently used on the Internet. 



5 Conclusions and Further Work 



The various GSM networks have the option to charge their subscribers using similar 
or different models for the new 2G and 3G GSM based Internet services, as is already 
the case with IG GSM mobile networks, due to a variety of technical, commercial. 
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geographical and political issues and concerns. In the mobile network market it may 
make technical and commercial sense to adapt and combine some, or perhaps all, of 
the above charging models and additional ones into unified flexible models which 
will cover the more diversified requirements of mobile charging. Figure 2 below 
shows an example of how the various charging models may be combined for charging 
for GSM voice and Internet services. 




Fig. 2. Combining Charging Models 

A basic customer subscription may comprise a fixed price tariff that includes free 
mobile calls up to a fixed limit, plus Metered Charging for extra mobile calls. Packet 
Charging may be included for Web browsing and email using 2G or 3G GSM for the 
subscriber s account. Expected Capacity may also be included for 2G and 3G GSM 
data traffic as well as Edge Pricing and Paris Metro Charging for email and data 
traffic. Subscriber s requirements vary greatly from students to business users, from 
light domestic users to heavy personal users. By modifying and combining charging 
models tariffs can be developed for the major demographic groups of subscribers. 

The illustration above shows the overlap between the various pricing models and how 
the boundaries can be made flexible depending on the subscriber usage profiles. For 
example, the mobile network operator may use Packet Charging in both Fixed Price 
and Metered Charging tariffs for some subscribers but only use Packet Charging with 
a Fixed Price charging model for other subscriber groups or tariffs. Some subscribers 
may only want to use limited Internet services, for example only text email and no 
Web browsing. The tariffs for the subscriber may become complicated but may 
ultimately give the subscribers more control over the way they are charged for using 
the mobile voice and Internet services and the QoS they receive from the network. 
The network operators will also have the advantage of being able to charge the 
subscribers for different level of QoS for the different services and network provision. 
Once experience of realistic network traffic in the next generation GSM networks has 
been gained and large quantities of network usage statistics have been collected and 
analysed, then more comprehensive or simplified charging models may be 
investigated, developed and prototyped from the ones discussed above. There will 
always be a trade-off between the complexity of the billing system to be implemented 
and the advantage the network provider will receive for having the systems in place. 
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Fixed price charging schemes reduce the overhead of the charging and billing systems 
infrastructure, as they tend to provide the simplest charging scenarios. Usage based 
charging models provide incremental and harder to predict income for the network 
operators as well as requiring more infrastructure investment. 

When the GPRS networks are up and running and the charging models are in place 
the economics of the market may take over and under- and/or over-provisioning of 
network resources may become apparent. Further work in this area should include the 
mathematical modelling of the various charging models on simulated mobile network 
data covering both voice and Internet data services. This may include examining the 
combining of charging models and the resultant effect on the income for the GSM 
network providers. The IETF and IRTF [15] currently have working groups on 
Authentication, Authorisation and Accounting (AAA) with goals that are relevant to 
the research reported in this paper. 
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Abstract. Future Internet services will be judged to a large degree by 
the efficient deployment of real-time applications. The issue of 
utilisation is especially central for the economic introduction of live 
video in packet switching networks; at the same time, the bandwidth 
requirement is high and the QoS guarantees are stringent. With static 
allocations based on arbitrary policy enforcement and scheduling 
mechanisms coupled to them this cannot be achieved. Instead, we 
propose a fast intra-domain bandwidth management approach that can 
help in rational policy decision making. It is based on inexpensive 
video smoothing and scheduling based either on adaptive tracking of 
aggregate video queue occupancy, or adaptive prediction of aggregate 
video arrivals. In this way, a general measurement-based utilisation 
theory for live video can be derived that ensures respectively constant 
maximum allocation utilisation guarantees or less-than-maximum 
allocation utilisation target following. QoS guarantees take the form of 
a constant maximum per-hop delay that is solely dependent on 
tracker/identifier sampling interval selection. Overall, we show how to 
provide support for live video under differentiated services. 

Keywords: Live video, measurement-based, adaptive tracking, 

adaptive prediction, utilisation theory, smoothing, scheduling, intra- 
domain management, differentiated services. 



1, Introduction 

Recent evolutionary development of backbone packet switching networks marks an 
important paradigm shift away from end-to-end per-flow traffic control, towards 
scalable service differentiation and consideration of traffic aggregates [1]. The 
introduction of new added-value services as IP telephony and virtual private networks 
is accelerated extending offerings from Internet Service Providers and exploiting net- 
work infrastructure for the deployment of novel applications. Yet, despite these ef- 
forts the biggest challenge for the Internet remains the support of real-time traffic with 
appropriate QoS and utilisation of network bandwidth [2]. Expectations for remote 
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instrument control, teleimmersive scientific visualisation and Internet TV, to name 
examples of demanding applications envisioned with Differentiated Services 
(DiffServ), are technology drivers for hard guarantees of real-time data delivery. A 
static statistical approach to real-time video service provisioning would be to increase 
backbone capacity based on some effective bandwidth formula that would give stat- 
istical QoS guarantees [3-5], With IntServ admission control executed at the edge, the 
approach would also promise an increased (but not quantitative) statistical multi- 
plexing gain under large aggregation conditions. What happens actually with static 
allocations is that when they are high QoS guarantees are indeed given but bandwidth 
utilisation is low; on the other hand when they are low the reverse is true. There 
appears to be an unavoidable tradeoff between performance and efficiency. Utilisation 
guarantees are not given at all and because of the difficulties encountered with video, 
namely high burstiness, long-range dependence and persistence, there is a tendency to 
propose peak rate allocation for real-time services. Moreover, effective bandwidth 
pre-calculation is hardly practical for real-time video due to its non-causal character 
and also because it usually involves traffic modelling assumptions that are hard to 
justify for general video sources. 

It must be stressed that the issue of simultaneous utilisation and QoS is of para- 
mount importance for the future of real-time applications in the Internet, if not for the 
packet switching concept as a whole. For the ISP, it gives the ability to maximise the 
return on investment in network assets and to rationalise this investment; the ability to 
answer the question of how much bandwidth is really needed for live video ag- 
gregates and when. For the end-user it means early introduction of real-time video ap- 
plications with acceptable pricing and very good quality at the same time. In our 
view, a theory that cannot give utilisation guarantees for video, but promises only 
QoS guarantees, albeit statistical, is only at an intermediate level. At the highest level 
we believe is a theory that achieves combined QoS and utilisation targets. A solution 
to the problem may be given by policy-based admission control for large time-scale 
iterative service design and dimensioning, while QoS-Utilisation control is enforced 
by a short time-scale, measurement-based bandwidth management and accounting 
mechanism. The iterations on service design are carried out through interaction with 
the bandwidth management and accounting module. 

Here, we attempt to put together a general utilisation theory for real-time video in 
environments of arbitrary aggregation. The framework provides constant maximum 
utilisation guarantees via smoothing and scheduling based on adaptive control of ag- 
gregate video queue occupancy. Less-than-maximum quantitative utilisation targets 
are tracked with smoothing and scheduling based on adaptive prediction of aggregate 
video queue arrivals. In both cases, QoS guarantees are a byproduct of the framework 
and entail constant maximum per-hop delay, dependent only on sampling interval se- 
lection. 

The rest of the paper is organised as follows. Section 2 gives an overview of the 
QoS-Utilisation control solution under differentiated services. Section 3.1 presents the 
closed-loop identification and adaptive control for guaranteed 100% allocation utilis- 
ation. Alternatively, in Section 3.2 the open-loop identification approach for less than 
maximum utilisation target tracking is described. Section 4 presents trace-driven 
simulation results. Finally, Section 5 concludes the paper. 
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2, QoS and Utilisation Control for Real-Time Video 

An example of DiffServ domain concatenation is shown in Figure 1. Packets from 
microflows are classified at the edge devices, queued according to the DSCP 
codepoint and scheduled for the next hop. Intermediate network elements, called core 
routers simply read the codepoint, queue and forward packets depending on the per- 
hop behaviour (PHB) implemented [6]. Important traffic will be treated preferentially 
with the expedited forwarding (EF) PFIB [7]. While QoS may be provisioned in this 
way for real-time video, it will invariably lead to ad hoc, too coarse, peak rate alloca- 
tions. 




Fig. 1. DiffServ-aware domains 

The framework proposed here uses two main mechanisms for intra-domain man- 
agement, namely smoothing and scheduling. Smoothing is performed either at the 
leaf/edge devices, or at every switching point with reduction of the smoothing period. 
The window of smoothing is selected so that delays contributed are not prohibitive for 
live video. An example may be given for MPEG video where typically I frames are 
larger than P frames and P frames larger than B frames. A good compromise for even 
the most demanding real-time service would be smoothing on a two-frame scale at the 
edge. For 25 frames per second, this would lead to an acceptable 80 ms delay. The 
justification on smoothing differs significantly from other related works. By realising 
the low-pass filtering effect of time-domain smoothing, one can draw on recent results 
based on a signal processing-queueing viewpoint [8]; they demonstrate that the low 
frequency part of an input process remains unaffected by queueing and therefore must 
be serviced if excessive delays and long-memory buffer processes are to be avoided. 
In [9-11] smoothing algorithms for stored or live video are addressed that try to 
reduce the variability inherent in constant quality video. They operate on a GOP or 
scene level and the associated smoothing-playback delays make them prohibitive for 
inelastic real-time traffic. In this proposal, the smoothing function serves the purpose 
of the adaptive control/prediction function. The scheduling is based on adaptive 
tracking of aggregate real-time queue occupancy, or adaptive prediction of aggregate 
real-time queue arrivals. For both cases, to achieve robust estimation, it is essential 
that the process variation scale be smaller than controller s/predictor s sampling 
interval. Unmodeled high frequency dynamics are known to produce bursts in 
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parameter estimates and subsequent instability [12]. A way to give the signal the 
proper frequency content is to filter the signal so that it has sufficient energy at low 
frequencies. Figure 2 shows the two component mechanisms when ingress smoothing 
is applied. 




Fig. 2. Bandwidth management with smoothing at the first trusted ingress node 

The adaptive tracker/predictor calculates the required allocation for live video that 
the scheduler must grant. It is essentially an adaptive prediction/control-theoretic 
(ADPCT) non-work conserving service discipline, since the scheduler services only 
the real-time packets that are within the bandwidth computation. Consequently, an 
additional and equally important benefit of the service discipline is that traffic is 
engineered to be not more bursty at downstream switches/routers (if employed in all 
nodes) and may require a lower allocation. The traffic pattern distortion limiting, jitter 
control and insensitivity to cross traffic achieved with our method are a direct result 
of the algorithm acting on output aggregates. 



3. Adaptive Prediction and Control-Theoretic Service Discipline 



In the following the time is slotted in fixed computation intervals of h time units and 
the allocation u(n) is available from the tracker or the predictor D time units after the 
sampling interval. We assume D to be much smaller than h, say equal to O.lh [13]. 
Figure 3 depicts the real-time queue available allocation slots on the [nh, (n+l)h] 
interval. The queue occupancy measured at the nth sampling instant is y(n). The 
actual queue arrivals for the [nh, (n+l)hj interval A(n) are shown with the down 
arrows in Figure 3. 

▼ VTviv !!▼▼▼▼▼▼ 
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Fig. 3. Allocation computation interval, allocations slots and queue arrivals 







328 Themistoklis Rapsomanikis 



3.1 Closed-Loop Identification for Guaranteed Maximum Allocation Utilisation 



The goal of adaptive tracking is to drive the queue occupancy y(n) towards a ref- 
erence point which is a constant level of packets. The queue dynamics are: 

3 1#( jNi 3 1#3 u\n 3 l#&/z ( Jj\n 3 1#3 wVi 3 . (1) 

Due to the difficulty of obtaining an analytical expression for A(n-l), we model y(n) 
as a time-delayed ARMAX: 

yVift( aiy'^ 3 1#( ( Oj^y'^ 3 ki) 3 * #( ( (2) 

( b^u4i 3*3 ff?#( qeVi 3 1#( 3 k#. 



where are i.i.d. zero-mean Gaussian variables, /I the delay and a,, Cj unknown 
and time-varying parameters. We write the above in terms of the forward shift oper- 
ator q\ 

A'^ftyyi#) where 

^ 3 1 f77 ^ 3 1 

a'^#) q'" { + Qi^^iq ,B%^) -I- andC'l^#) q’' { + Ci^^iq . 
i) 0 i) 0 /) 0 

In the case of a minimum variance adaptive controller, the time delay is equal to the 
pole excess of the system, A=k-m. Then (3) is equivalent to: 

A'^tty'^tt) B\qfki\ntt( with (4) 

A^#) q’' { + a^^^q' ,B%^) + and C\^#) q’' { + c^^^q' . 

i) 0 i) 0 i) 0 

For closed-loop identification of system parameters we choose RLS estimation which 
shows good convergence and bias [14]. The self-tuner is an indirect one since the 
coefficients of the polynomials A, B and C are first estimated by recursive least 
squares and then used in the control design, as if they were the true ones. Let, 

Z>o^# Zi„\n#, C jVz# 

y\ij 3 1# 3 3 A:#, u4i 3 A: ( u4i 3 k#, . 

3 1# e\Zi 3 
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Then the true state at step n is: 

y\Z7#) 



( 6 ) 



The estimation vector is defined as: 

bo^itt b„\nit, c^\nS. 



The estimated state at step n is: 
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The RLS computation of the estimation vector with variable forgetting is given by: 
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the variable forgetting factor [15], The control design is based on predicting the 
output y(^Kj zl steps ahead (for the rest of this paper zl=7) and then computing input 
ufnj such that the predicted output is equal to the setpoint, constant B^O in our case. 
For indirect minimum variance self-tuning this is accomplished by solving the 
following Diophantine equation: 

q*^‘cyi,q#) A\n,qiF\j,q#( where 

F^q^) ( ( /*3i(K)and 

G\^i,q#) ( ( Sk3\^^- 

Then the control input is given by: 

B\n,q^^,q^u\i^) q^^C\i,q^B ^ G\i,q^^^. 

We have completed the adaptive controller design. Now, the available allocation slots 
in the [T„, T„+D) and [T„+D, T^+j) intervals are respectively: 

N^i^) u^i^\k),Nh'^h dI (12) 

The following theorem gives the video allocation utilisation achieved with sched- 
uling based on adaptive tracking of smoothed video aggregate queue occupancy. We 
introduce the conditions: 

AP. The smoothing interval is an integer multiple of the controller s interval and 

the two mechanisms are synchronised. 

A2\ The adaptive tracker converges to the constant reference point Bi^O and with 

such a negative prediction error that always y(n)>0. 

Theorem P. If Conditions APA2 are satisfied, then the utilisation for every packet 
leaving the queue is constant and equal to 100%. Furthermore, the maximum delay 
that a live video packet may experience is equal to T^+T^<h, where is the re- 
maining time, from the arrival of a packet at the queue to the end of the sampling 
interval and 7) the time to schedule the packet at the current interval. 

Proof. When tracking is accurate, the queue occupancy y(n) fluctuates around B so 
\h?iiy(n)>0 7 n. Therefore the allocation u(n) is such that: 
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y\n#( aWs (13) 

During the [nh, (n+l)h) interval, arrivals are uniform because of smoothing s dur- 
ation and smoothing-scheduling synchronisation. Actually they are uniform over an 
Sj)=jh interval where j an integer. It follows that the intersample queue oceupancy 

y(t)>0, t9 [nh, (n+l)h). So, the time between scheduling for transmission the current 
and the previous packet is: 

1/m|« 3 1=, t9 [nh,nh ( D) (14) 

r/V#) f 

• 1/wVzt, t9 [nh ( D, (n( l)h). 

Remember that the scheduler does not stay idle with available live video packets, as 
long as they are within the computed allocation. Since the utilisation for every paeket 
is given by: 

l/wWlf't/V#, t9 [nh,nh ( D) (15) 

dV#) f 

• l/wVif't/V#, t9 [nh( D, (n( l)h) . 

it follows from (14) and (15) that all live video packets leave baek-to-back (in respect 
to each other) and the utilisation is constant 100%. As for the maximum delay, since 
queueing is FIFO and we assume aecurate tracking, the eells left at the queue are 
expeeted to be the ones that arrived shortly before the end of the current interval, 
time units before the next sampling instant. They will be the first ones to be served 
during the next interval, 7) time units after the next sampling instant. Consequently 
the maximum packet delay is T^+T^. This completes the proof of Theorem 1. 

A natural question is what will happen when A(n)<ll The answer is that if the 
control system tries to adapt under such poor excitation conditions, packets will be 
delayed more than T^+T^. The safeguard is to monitor the arrivals and when A(n)=0, 
to switeh off adaptation and transmit awaiting paekets at wire speed. The maximum 
delay for these final packets will be T^+T^+Nj&l/ri, where A^the queue oceupaney 
after poor excitation has been identified and Vj the line rate. 

From the discussion above it is evident that the Condition A2 is the strongest one. 
It is well known that a rigorous convergenee analysis of stochastie adaptive control- 
lers for general stoehastic processes is very complicated and only reeently have con- 
vergence problems of several basic least-squares self-tuning regulators been solved 
completely [16]. Moreover, as we have pointed out earlier, it is very difficult to come 
up with traffic models that are general enough to represent all video contents in a 
imified way. And even if such a complex model existed, its on-line fitting would 
prove problematie. Having said this, it must be noted that adaptive eontrol has found 
mueh suceess in applications and simulations presented here indicate that the self- 
tuning tracker does converge for smoothed live video aggregates, with RLS 
identifieation enhanced with a variable forgetting factor. 
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3.2 Open-Loop Identification for Less-Than-Maximum Utilisation Target 
Following 

The goal of adaptive prediction is to estimate live video queue arrivals and taking the 
estimate as the true value, compute an allocation that ensures a specific utilisation 
level at each sampling interval. Therefore, the target here is not the queue occupancy 
but the utilisation U itself, U<1. How close to the utilisation target we get, depends on 
the prediction accuracy. At the end of each sampling interval the queue occupancy 
will hQy(n)=0 if estimation is accurate, because of the (over)allocation. The intuition 
is that, if the arrivals were completely known, then by adaptive tracking of an equiv- 
alent negative queue occupancy we could enforce a specific number of allocation 
slots (equal to equivalent negative queue occupancy) that would be lost for the 
specific number of arrivals. Since the arrivals are not known, the only way to do this 
is by predicting the arrivals and overallocating, the prediction taken as the true value. 
It is as if we are tracking a variable equivalent negative queue occupancy . 

The model here is a time-delayed ARMA: 

cXjtkyi#) D\q#eVi#, (16) 

c\^#) andZ)\^#) q'' ( + . 

i) 0 i) 0 

where e(n) are i.i.d. zero-mean Gaussian variables as before. Again we choose RLS 
identification with a variable forgetting factor. Here, 

i yVi 3 1# 3 y^i 3 kit, e4i 3 1# 3 kWc 

For the interval fnh, nh+D) no live video allocation is performed. This is to prevent 
correlation with the previous interval s allocation, since allocation is based on arrival 
estimation now. This means that scheduling is actually performed in the [nh+D, 
(n+l)h) interval, but the arrivals are counted for the whole [nh, (n+l)h). Besides 
counting a measurement of y(n), the queue occupancy is taken at each sampling 
instant. The available allocation slots in [nh+D,(n+l)h) that take into account the 
backlog from the previous interval y(«) are: 

pV?#) C-y'^^i where 
C ) U^^,U the utilisation target . 

The following theorem presents utilisation guarantees achieved with scheduling based 
on adaptive prediction of smoothed video aggregate arrivals. The conditions now are: 
Bl : The smoothing interval is an integer multiple of the predictor s interval and 
the two mechanisms are synchronised. 

B2\ The prediction errors are white noise, or at most short memory. 

Theorem 2\ If Conditions B1,B2 are satisfied the utilisation is constant in the al- 
location interval, but exhibits an inter-alloeation variation around the target U. 
Furthermore, the maximum delay that may be experienced is T^+Tf<h, where 7), is 
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the remaining time, from the arrival of a packet at the queue to the end of the 
sampling interval and the time to schedule the packet at the current interval. 

Proof. When arrivals are uniform during the [nh, (n+l)h) interval the utilisation 
for every packet is: 

It is obvious that if the prediction is always accurate, the utilisation is exactly U in all 
allocation intervals. But positive and negative prediction errors will occur, leading to 
a slight fluctuation around the target U. In the case of positive prediction errors, the 
utilisation U(n) will be higher than U, while in the case of negative prediction errors it 
will be lower than U. For positive prediction errors a backlog 7^ >0 will remain at 
the queue, only if the numerator in (19) is higher than the denominator. For negative 
prediction errors or accurate prediction, y(n+l)^0. The allocation slots are just as in 
(18), to compensate for the occasion when there is a backlog at the queue. This also 
justifies why the maximum delay is T^+Tf. When there is a positive prediction error 
that leads to a backlog, the maximum delay is 7j,+7)- In all other cases, the maximum 
delay is 7),. This completes the proof of Theorem 2. 

Again, as in Theorem 1, Condition B2 is the most important. Our simulation 
studies show that the identifier is capable of following smoothed video aggregates and 
the prediction error is white noise. The same as in Theorem 1 applies when A(n)^0. 
The prediction is switched off and if there is any backlog, it is scheduled back-to-back 
at line rate. 



4, Experimental Results 

In this section, the performance of the adaptive tracking/prediction solution is evalu- 
ated with trace-driven simulation. From the Star Wars MPEG-1 trace [17] coded with 
N=12, M=3, 25 frames per second and 384<^88 pels resolution, we create 10 inde- 
pendent video sources with the rule given in [18]. To study a worst case scenario, I- 
frame alignment was chosen. The equivalent aggregate trace (in cells) is shown in 
Figure 4. For the adaptive tracker operating at a 1 ms sampling interval we consider a 
system model with k^2 and The setpoint is B^2 cells. The variable forgetting 
factor is l(n)=0.6l(n-l)+(l-0.6) and l(0)=0.65. Figure 5 shows the resulting live 
video queue oecupancy. It is evident that the controller tracks very accurately the 
target with only one cell offset and minimal startup transient. The allocation granted 
to the live video queue, represents the control effort needed for traeking. It is es- 
sentially the price paid for minimising the system output (here the buffer occupancy) 
variance. Bandwidth savings of 28%-100 % compared to peak rate allocation are 
provided even for such a small number of sources. 
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Fig 5. Aggregate video queue occupancy for scheduling based on adaptive tracking 

Next, results with the same aggregate traffic but for scheduling based on adaptive 
prediction of video arrivals are presented. Again the sampling interval is 1 ms and for 
the system model we choose k=2, with the same variable forgetting factor. Figure 9 
shows the allocation utilisation U(n). Here, the utilisation is not constant but 
fluctuates around the target U=0.9 with good transient and steady-state response. In 
this case too, despite overallocation, bandwidth reductions are high. For the arrival 
estimation error, a zero mean white noise of only one cell variance is demonstrated. 
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Fig. 6. Allocation utilisation for scheduling based on adaptive prediction an a target U=0.9 



5, Conclusions 

The differentiated services architecture has received much attention, mainly because 
of its promise for scalable QoS deployment by grouping microflows into behaviour 
aggregates at domain edges and keeping cores stateless. While this eases the task of 
introducing QoS into the backbone IP clouds, compared to an end-to-end signalling 
paradigm like Integrated Services or switched ATM, the issue of simultaneously 
providing QoS and Utilisation guarantees remains a fundamental one for future 
Internet applications based on packet switching, especially the ones that have high 
bandwidth demands, require hard assurances of low latency and are to a great extent 
unknown, as live video. Interestingly, the current focus of DiffServ research and 
development is on the inter-domain management of SLSs through policy-based 
admission control and automatic Bandwidth Broker communication. However, for 
real-time applications more important may be the deployment of intra-domain per- 
hop management mechanisms that are fast and accurate, decoupled from longer time- 
scale policies and provide an efficient bandwidth accounting and administration 
platform on which informed policy decisions can be made and not the other way 
around, as the current status is with RED, WFQ, CBQ, or PQ mechanisms. In this 
paper, such a framework has been presented that allows the introduction of a unified 
measurement-based Utilisation theory for live video. It uses smoothing and 
scheduling based on adaptive control of aggregate video queue occupancy or adaptive 
prediction of aggregate video arrivals. QoS guarantees are inherently given in this 
framework through tracker/identifier sampling interval selection alone. Under 
adaptive tracking constant 100% allocation utilisation guarantees are provided, while 
less-than-maximum utilisation targets are reached via adaptive prediction, irrespective 
of number of flows, buffer size or live video traffic non-stationarities. 
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Abstract. Implementation and Evaluation of new Internet communica- 
tion systems face some general problems. The implementation of Qual- 
ity of Service concepts in kernel space is complex and time consuming, 
while the final setup of the evaluation networks lacks the desired size 
and flexibility. The usual alternative simulation cannot cope with the 
real world. This paper presents a concept to combine real components 
like hosts and routers with simulated nodes. This simplifies the setup 
of huge test scenarios and the implementation of new concepts, while 
keeping the evaluation results realistic. This paper presents the concept 
and the implementation of virtual routers by its application in a Differ- 
entiated Services Networks. 



1 Introduction 

A general problem in research on networking is the demand for setting up test 
beds of sufficient size and complexity to show the desired results or to prove a 
new concept. Alternatively network simulators like ns [ns] or OpNet [opn] can be 
used to prototype a device or a protocol in the special simulation environment 
and to run the desired tests. So the simulation normally precedes the setup of 
the test scenario in a laboratory. Nevertheless there are a few drawbacks using 
this approach: 

~ The simulations can often only cover only a single aspect of the problem, 
ignoring side effects. 

— Especially for application oriented research simulators lack complexity. So 
realistic traffic sources and sinks are often missing as well as fully functional 
protocol implementations. 

— Ad hoc implementations on real platforms are often extremely time consum- 
ing (Linux kernel hacking) . 

— The evaluation of new components depends on realistic traffic. Simulators 
normally allow to define an abstract traffic type or to simulate load based 
on a real network’s traffic logs, but lack interactive real time evaluation. 
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~ Most simulators are not able to combine real network components with sim- 
ulated environments, a functionality, which can be very useful during the 
development or debugging of programs (see also [BEF+00]). 

Because of this, we propose an approach, to combine reality with simulations, 
using real devices wherever needed and emulate the rest. 

2 Softlink Device and Virtual Router 

The basic idea of combining real hardware with an emulated topology is shown 
in figure 1. The core mechanism is that of a Virtual Router (VR) emulating a 
real IP packet forwarder. Each VR has a couple of interfaces attached, which 
can be connected to other interfaces of VRs (the dashed lines) or via softlink 
device to the local host. Each host might run multiple VRs. Note that the host 
includes a complete TCP/IP stack and applications running on it. 
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Fig. 1. The components of a VR 



The network layer of the host system should not detect any differences be- 
tween the real network and the emulated topology. So it is possible to define 
an emulated topology consisting of multiple VRs distributed of several ma- 
chines. The communication between the real world and the emulated topology 
is achieved by the softlink devices. Such a softlink device acts as an interface 
between the operating systems network layer and a virtual router. For the op- 
erating system (OS) kernel/user space it looks like a normal Ethernet device, 
transporting data to user space and vice versa. The only additional functionality 
is the truncating of the Ethernet headers, so raw IP is transferred to user space. 
For later versions this truncating should be omitted, requiring the processing of 
raw Ethernet frames in user space. Then the type of the encapsulated datagram 
could be analyzed to distinguish different protocols. The softlink device has been 
implemented as a Linux Kernel module for kernels > 2.2.12. 

Virtual Routers (VRs) are used to realize the network topology to be emu- 
lated. Figure 2 shows the principal software architecture. Following the primary 
task the architecture is focusing on IP routing. The VR is completely imple- 
mented in plain C-|— 1-, making the source extensible and easy to port. 
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The central forwarding mechanism (fwd circle) acts on standard routing rules, 
but was extended to allow routing decisions by source addresses, port numbers, 
protocol fields and TOS values. 

As an interface to programs running on this virtual host, the VR will would 
use usual IP stacks with TCP, UDP and ICMP. Actually only a simple ICMP 
stack for debugging purposes has been implemented, allowing a ping to the 
virtual host. 

The main work regarding IP processing is assigned to the interface com- 
ponents underlying the routing mechanism. Figure 2 shows two of them. Each 
interface can be connected to a softlink device, acting as a transition point to 
the real network or to another VR-interface. For the connection to other VR 
interfaces we use UDP. 

Received data is first processed by an IP address translation unit (NAT). 
After that step packets are delivered to the host filter. This unit is programmable 
and allows the processing of specific streams at higher layers. This simplifies 
the implementation of certain daemons, but it’s also a great facility to process 
streams during transmission (e.g. video- or audio data). As a default the host 
filter only checks for IP packets addressed to the virtual host and the Router 
Alert Option [Kat97]. 

Acting as sender, data is also transported through host filter and NAT to be 
put to the queueing system before transmitted by the softlink device or sent via 
UDP. A token bucket filter preceding the connector is used to limit the maximum 
bandwidth of the interface. 

Because of it’s flexibility the queueing system is the most complex part of the 
interface. It consists of a pool of components like queues, filters, shapers, sched- 
ulers. The current implementation offers the following components: a generic 
classifier, a Token Bucket Filter, a drop tail queue, a Random Early Detection 
queue (RED) [FJ93]), a Weighted Fair Queueing (WFQ) scheduler, a simple 
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Round Robin (RR) scheduler and a Priority Round Robin (PRR) scheduler. 
Additional components are a RED queue with three drop precedences (TRIO), 
a special marker for differentiated services and a Priority Weighted Fair Queueing 
(PWFQ) scheduler for the implementation of Expedited [.JNP98] and Assured 
Forwarding [HBWW98]. The configuration of the queuing system can be com- 
pletely done at runtime via API or command line interface (CLI). The object 
oriented implementation of the queueing system and it’s components makes it 
easy to add or modify single functionalities. 

For the configuration of the VR a command line interface and an API has 
been implemented. The API allows programs, running on the virtual host, to 
change interface setting, routing rules and so on. The command line interface is 
accessible via console or external^ telnet. 



3 Network Setup 

3.1 A Minimal Setup 

In theory any network topology can be realized on only one host. The number 
of VR’s is limited by the system resources only. However, before things get 
complicated we will demonstrate the most simple setup first, using only one VR, 
being connected to a host. 

The host gets an additional interface solO with the IP address 10.1.1.1 using 
the softlink device. 

solO Link encap: Ethernet 

HWaddr 00 : 53 : 4F: 46 : 54 : 4C inet addr : 10 . 1 . 1 . 1 
Beast : 10 . 1 . 1 . 255 Mask : 255 . 255 .255.0 
MTU: 1500 Metric :1 

UP BROADCAST RUNNING NOARP MULTICAST 

[. . .] 

We choose here non routed addresses of the type lO.x.x.x. As a next step 
we setup the VR. We configure an interface if 0 with the address 10.1.1.2 and 
connect it to the solO, setup a minimum base queuing system consisting only 
of a single drop tail queue and set the according routing rules. 

# 

# Interface SETUP 

# 

interface ifO 10.1.1.2 255.255.255.0 
interface ifO connect /dev/solO 
interface ifO sqc create droptail 
interface ifO sqc chain 0 to 1 
interface ifO sqc chain 1 to 0 

# ROUTING TABLE 

# 

route add 10.1.1.0 255.255.255.0 ifO 
route add 0.0. 0.0 0.0. 0.0 ifO 

^ This telnet connection is not provided by the VR, but by the host making it indepen- 
dent from any changes made to the VR. This simplifies setup of multiple machines 
crucially, because no changes to interfaces setup or queueing mechanisms can harm 
the TCP connection used for the conhguration 
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Now we can test the scenario by pinging to the ICMP stack of the VR. A 
ping 10.1.1.2 results in: 

PING 10.1.1.2 (10.1.1.2): 56 data bytes 
64 bytes from 10.1.1.2: icmp_seq=0 ttl=187 0.5 ms 
64 bytes from 10.1.1.2: icmp_seq=l ttl=187 0.2 ms 
64 bytes from 10.1.1.2: icmp_seq=2 ttl=187 0.2 ms 

Following these example it is easy to setup bigger topologies. To connect two 
VRs via UDP only the command interface ifO connect /dev/solO hast to 
be changed to interface ifO connect golem: 8000 8001. The VR then will 
send outgoing packets via UDP to host golem on port 80000 and expect incoming 
packets on port 8001. 

3.2 A TCP Capable (Tunnelling) Setup 

The minimal example has shown that a VR can be set up like a normal router 
connected to our host. But even if we can ping to the VR, it is not possible to 
open a TCP connection between our host and the VR, because of the missing 
TCP and UDP stacks. So we need two real hosts as source and sink and two 
VRs to transport the traffic. 

Figure 3 shows the principal setup of the two hosts. Source and sink are real 
machines, so we do not need to rely on a (more or less complete) simulative 
TCP implementation in the VRs, but can use wide spread, comparable TCP 
implementations of the host systems. When we open a TCP connection from 
Host A to 10.1.3.1, the traffic is routed over solO to VR A, is then encapsulated 
in UDP packets and transported over some kind of UDP tunnel (the dashed line 
in figure 3) to Host B (to be more exactly to 130.92.70.7) and finally to VR B 
which decapsulates the traffic and forwards it via solO on Host B. 
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Fig. 3. A TCP capable Setup with two hosts and two VRs 



It is easy to see, that each host can serve as more than a source or a sink. With 
each softlink device added to the host and the VR you gain a source respectively 
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a sink. Of course the number of VRs is limited by the number of tunnels the 
physical connection between the hosts can manage. But as long as the configured 
bandwidth of the VRs’ interfaces is small compared to the available bandwidth 
of the host there are no problems to be expected. 



# 

# Interface SETUP 



# 

interface 

interface 

interface 

interface 

interface 

# 

interface 

interface 

interface 

interface 

interface 

# 

# ROUTING 

# 

route add 
route add 
route add 
route add 



ifO 10.1.1.1 255.255.255.0 
ifO connect /dev/solO 
ifO sqc create droptail 
ifO sqc chain 0 to 1 
ifO sqc chain 1 to 0 

ifl 10.1.2.1 255.255.255.0 

ifl connect 130.92.70.7:8042 8041 

ifl sqc create droptail 

ifl sqc chain 0 to 1 

ifl sqc chain 1 to 0 

TABLE 

10.1.1.0 255.255.255.0 ifO 

10.1.2.0 255.255.255.255 ifl 

10.1.3.0 255.255.255.255 ifl 

0.0. 0.0 0.0. 0.0 ifO 



The script shows the setup of VR A. VR B has to be configured in an 
analogous way. VR A has it’s interface ifO connected to the softlink device on 
Host A and the interface ifl via UDP tunnel to the according interface on VR B. 



3.3 Using Address Translation for Network Setup 

On a long term it is not satisfying, that for the use of higher protocols at least two 
machines have to be used. Even if with Gigabit Ethernet the physically available 
bandwidth is no scarce resource this problem limits the usability of VRs. At the 
moment the VR has no TCP or UDP stack, requiring real hosts as end systems. 
So at for a minumum setup at least two systems would be necessary. This was 
the reason why the already mentioned Network Address Translation unit was 
added to the interface structure (see Figure 2). NATs avoid the problem, that 
tweo real hosts are required. In the following we will give a short example how 
to use NAT to route IP packets through a VR. 

Figure 4 shows the IP address translations occurring during forwarding 
through the router. Each address pair represents the source and destination IP 
addresses of the ICMP ping packet at each state in the VR. The interface comes 
with two NAT filters, one for incoming packets and the destination address and 
one for outgoing packets and the source address. 

It should be mentioned, that the source host and the sink host are the 
same machine, but connected to two different softlink interfaces. (1 0.1.1. 1 and 
10.1.2.1). The source connects to a dummy address 10.1.1.42, being routed over 
the VR, which converts during transmission the packet’s destination address to 
10.1.2.1 and the source address to 10.1.2.42. Any response sent to 10.1.2.42 will 
be converted in an analogous way. 
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Fig. 4. A Setup using Network Address Translation features 



# 

# Interface SETUP 

# 

interface ifO 10.1.1.2 255.255.255.0 
interface ifO connect /dev/solO 
interface ifl 10.1.2.2 255.255.255.0 
interface ifl connect /dev/soll 
# 

# MAP IP ADDRESSES 

# 

interface ifO rmq add 10.1.1.42 255.255.255.255 10.1.2.1 
interface ifl smq add 10.1.1.1 255.255.255.255 10.1.2.42 

# 

interface ifl rmq add 10.1.2.42 255.255.255.255 10.1.1.1 
interface ifO smq add 10.1.2.1 255.255.255.255 10.1.1.42 
# 

[... Setup of Queueing Systems ...] 

# 

# ROUTING TABLE 

# 

route add 10.1.1.0 255.255.255.0 ifO 
route add 10.1.2.0 255.255.255.0 ifl 
route add 0.0. 0.0 0.0. 0.0 ifO 



Now we apply a simple queueing system to the interfaces. Figure 5 shows the 
setup for bandwidth reservation. The classifier C forwards packets according to 
their header data to different queues (Qi and Q 2 ) with the branch over Qi being 
limited by a token bucket filter. A standard Round Robin is used as scheduler 
limiting all non TCP traffic to a maximum of 2 Mbps. 

Figure 6 shows a simple evaluation of this queueing system with an aggressive 
UDP and a TCP flow. The TCP datagrams are forwarded over Q 2 , UDP over Qi ■ 
The token bucket filter T is set to a bandwidth of 2 Mbps. The VR interfaces are 
limited to 4 Mbps. The UDP source sends in intervals, to visualize the reaction 
of TCP and the queueing system to massive UDP bursts. 
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One can clearly observe, how the TCP bandwidth decreases from 4 to 2 
Mbps, when UDP uses the available bandwidth of 2 Mbps. The short peak of 
the TCP flow below the guaranteed bandwidth is caused by TCP congestion 
control. The graph was obtained by a setup using one VR with two interfaces 
and network address translation (NAT) as shown on figure 4, so the TCP source 
and the TCP sink were hosted on the same machine. 

This scenario is comparable with a real setup of three machines, one acting 
as sink, one as intermediate router and one as source. 




Fig. 5. A minimum queueing system setup for bandwidth allocation 




Fig. 6. Bandwidth allocation to specific a TCP flow 



4 Differentiated Services Setup with Virtual Routers 

In this section we show how a typical Differentiated Service evaluation scenario 
can be setup using Virtual Routers. Of coarse any mixture of real and virtual 
components is possible, so VRs might play a role during developing real im- 
plementations for debuging purposes, without occupying a whole pool of unix 
workstations and routers. 
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4.1 Setup of a DiffServ Queueing System 

Figure 7 shows a possible configuration of an VRs queueing system for Differen- 
tiated Services. 



EF 




Fig. 7. A Differentiated Services Queueing System with an Best Effort, an Ex- 
pedited Forwarding and a Queue for one Assured Forwarding class 



The first component BM is a, generic DS marker, measuring flows and re- 
classifying them according to their Service Level Agreement. C is the classifier 
forwarding the packets to the according queues. The queues for Expedidted For- 
warding (EF) and for Best Effort Traffic (BE) are standard drop tail queues. As- 
sured Forwarding (AF) traffic is sent to a TRIO queue (Three state RED with in 
and out [FJ93]), dropping packets according to their ToS field values. The sched- 
uler S' is a Priority Weighted Fair Queueing (PWFQ) scheduler [BESSOO]. This 
scheduler has been specially designed at our institute for the implementation of 
Differentiated Services and allows to favour some queues as priority queueing 
and to allocate a share of the available reosources to the others as WFQ does. 
So EF packets get the absolute priority while the other queues get a share of the 
bandwidth according to their weight. To prevent EF traffic to block all other 
flows, EF is limited by a token bucket filter. 

4.2 Setup of the Evaluation Topology 

In this section we describe a setup for a Differentiated Services evaluation and 
it’s realization on a VR architecture. As a basis for the network layout we use the 
topology of the SWITCH network in Switzerland. Figure 8 shows the existing 
network and the representation with Virtual Routers. Each access network is 
realised by one softlink device. Of course a more complex setup of the access 
network could also be implemented. The nodes of the SWITCH network are 
emulated on three hosts, each running several instances of Virtual Routers. 
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Fig. 8. The SWITCH network (from http://www.switch.ch/lan/national.html) 
and it’s representation with Virtual Routers 
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5 Summary and Outlook 

The idea of Virtual Routers and softlink devices presented here has proven useful 
for the quick developement and evaluation of new traffic conditioning equipment 
and for emulating bigger testbeds for the debugging of ’real’ programs. 

The results available so far correspond measurements in real scenarios. Of 
course Virtual Routers are not capable to emulate exactly a real network be- 
haviour and will never be. 

Future extensions will focus on two objectives: managability and portability. 
Work on a graphical user interface for the setup and control of multiple Virtual 
Routers distributed over multiple hosts has just started and even direct interfaces 
to topology generators like Tiers [CDZ96] are planned. The goal is to allow the 
setup of evaluation scenarios with dozens of routers distributed over a pool of 
workstations in a time you usually need to configure one device. 

Because of better portability between Virtual Routers and real machines an 
port of Berkeley Sockets to the VR platform is planed, allowing to run programs 
under a VR environment using the same techniques to access the network as on 
real unix hosts. These also includes the implementation of an Active Networking 
environments and the according protocols like the Active Network Encapsulation 
Protocol (ANEP) [ABG+QT], providing a platform for the evaluation of Active 
Networking and traffic control cooperation. 
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Abstract. This paper introduces the Differentiated Services Resource 
Updating Protocol (DSRUP), a resource management protocol to be im- 
plemented in a DiffServ environment. The protocol makes use of periodic 
updating information sent from the egress edge of the DiffServ domain. 
Changes are made to this information as it passes through interior nodes 
to reflect the current resource conditions in the domain. Ingress nodes 
will make use of this information to make informed resource management 
decisions of traffic entering the domain. Initial simulation tests show that 
the protocol is able to provide better resource utilization when compared 
with conventional DiffServ. 

Keywords: Differentiated Services, QoS, Resource Management, Ad- 
mission Control 



1 Introduction 

Work over the past few years on providing Quality of Service (QoS) over net- 
works has led to the development of many technologies. One of the goals is 
to provide a differentiation of services over the networks with different perfor- 
mance and quality guarantees associated with each service. The developments, 
particularly that of Integrated Services (IntServ) [1] and the Resource Reserva- 
tion Setup Protocol (RSVP) [2] by the Internet Engineering Task Force (IETF), 
have allowed networks the ability to provide end-to-end guarantees for different 
kinds of services. 

The IntServ/RSVP architecture utilizes per-ffow signaling to provide ser- 
vice performance (e.g. bandwidth, delay) guarantees and differentiation. This 
approach, although effective in providing QoS to a relatively sized network, 
poses scalability problems in larger networks like the Internet. The Differen- 
tiated Services (DiffServ) framework [3] was developed to both replace as well 
as complement IntServ in providing QoS over large networks. 

DiffServ essentially classifies packets into one of a small number of aggregated 
flows for processing. This is done by setting the Type of Service (TOS) octet of 
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the IPv4 header or the Traffic Class octet of the IPv6 header, both renamed as 
the DS field. Therefore, instead of numerous individual microfiows in a network, 
only a small number of aggregated flows are seen. DiffServ provides for scalable 
QoS by keeping the interior architecture (routers, schedulers) as simple as pos- 
sible and pushing complexity to the edge of the network. Hence, the boundary 
routers are responsible for conditioning, shaping, admission control, etc. 

Although the hope was for DiffServ to be as simple a scheme as possible (at 
least in its core), we note that the basic mechanisms defined by IETF may not be 
sufficient to provide for an effective service differentiation in the network. In fact, 
when DiffServ was first proposed in [4] , it was noted that “Independent labeling 
by individuals is simple to implement but unlikely to be sufficient since it’s 
unreasonable to expect all individuals to know all their organization’s priorities 
and current network use and always mark their traffic accordingly.” Research 
results in [5, 6, 7,8] have also shown that in some cases, existing mechanisms may 
result in unfairness and inefficient resource utilization in DiffServ. 

There have been much activity among the DiffServ community in trying to of- 
fer a solution to better manage resources [8,9,10,11,12]. In this paper, we propose 
a protocol that provides assessment of current resource availability in a DiffServ 
domain, through a periodic updating from the network core. The protocol is 
called Differentiated Services Resource Updating Protocol (DSRUP). 

The objective of DSRUP is to provide effective and efficient resource man- 
agement in a DiffServ environment with minimal increase in the complexity of 
the architecture, especially within the network core. 

Section 2 contains the definition of some terms which we will be using in 
this paper. Section 3 provides an overview of the DSRUP protocol by giving 
an example of how it works. In Section 4, we will briefly describe the basic 
architectural model of DSRUP and how it differs from conventional DiffServ^ . 
Section 5 will discuss results of simulations that we have conducted to evaluate 
the effectiveness of the protocol. Finally, we conclude the paper in Section 6. 

2 Definition of Terms 

Over time, different otherwise common networking terms have been used to mean 
different things by different authors. We will define some of the terms we will be 
using throughout this paper so that there will be a common understanding as 
to what these terms mean here. 

Fig. 1 shows the external connections of a typical DiffServ network. 

DS domain. A DiffServ-capable domain with a set of nodes operating with a 
common set of service provisioning policies and Per-Hop Behavior (PHB) defi- 
nitions. 

^ The main goal of the IETF DiffServ Working Group is to develop a general concep- 
tual model of the service and not the full service. Therefore we can only define our 
DiffServ model to our best understanding of how it will be like, based on literature 
currently available. Conventional here means an architecture that has been defined 
and accepted by the working group. 



350 



Joo Ghee Lim et al. 




Fig. 1. The external connection of a typical Differentiated Services network 
showing the boundary routers 



Boundary router. A DiffServ node within a DS domain that connects the domain 
to a node either in another DS domain or a non DiffServ-capable domain. 

Edge router. A node in another DS domain or non DiffServ-capable domain that 
connects to the boundary router of a DS domain. This is the last node (egress) 
before a packet enters into the DS domain in question, or the first node(ingress) 
the packet meets after leaving the DS domain. This can also be a source or 
destination host that is directly connected to the DS domain. 

Interior router. A DiffServ node within a DS domain that is not a DiffServ 
boundary router. 

3 Overview of the Protocol 

The basic idea of DSRUP is to allow the ingress boundary router of a DiffServ 
network to be aware of the availability of resources in the network. The informa- 
tion of resource availability is provided by the downstream interior router. This 
information will enable the boundary router to make informed decisions (e.g. 
admission control, QoS routing) that will ensure efficient use of the resources in 
the network. For much of this paper, DSRUP will be used primarily to provide 
information for making admission control decisions over DiffServ. 

3.1 Upstream Edge Router or Host Action 

When a new flow reaches the egress edge router of the upstream domain or a 
new flow is started by a source host, a reservation request will be passed to the 
boundary router. This may take the form of a modified RSVP PATH message 
(in the case of IntServ edge) or a special request packet containing the services 
needed (i.e. EF, AF), the resource needed (amount of bandwidth, delay bound, 
etc.) and the destination address the flow is destined. 
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3.2 Boundary Router Action 

Each DSRUP-capable boundary router will maintain a Resource State Table that 
is updated periodically with information provided by the downstream interior 
routers. The table includes the next-hop router leading to each egress boundary 
router of the DS domain and the least amount (i.e. the bottleneck) of resources 
found along this route to the egress router. An entry is kept for each class (or 
PHB) of service. Based on the information found in the table, the boundary 
router will feedback to the upstream router or host as to whether the flow is 
accepted. 

If the flow is accepted by the boundary router, the upstream edge router or 
host will proceed to mark (or remark) the DS held of the packets belonging to 
the flow to the PHB requested. As a result, the packets entering into the DS 
domain will not be identified by its individual flow but by the DSCP. At the 
interior routers within the core of the DS domain, the packets will be processed 
based on the PHB associated. 



3.3 Resource Updating: Job of the Interior Routers 

The bottleneck resource found along the path from the ingress boundary router 
to the egress boundary router of the DS domain must be updated periodically 
to ensure that the Resource State Table is current and accurate. We will now 
explain the process by which this information is updated by way of an example. 

In this paper, we will be using the bandwidth of the links between routers as 
the resource we want to manage. Each service class (or PHB) may be allocated a 
certain percentage of the total bandwidth in the link. This may be agreed upon 
between service providers of adjacent domains or between a service provider 
and a customer, in the form of service level agreements (SLAs). For example, a 
particular class A may be given 20%, class B 20% and class C 60% of the total 
bandwidth. 

The egress boundary router of the DS domain periodically monitors its down- 
stream link and sends out a packet containing information about the bandwidth 
available for each class of service to its upstream routers. In the example shown 
in Fig. 2, egress router Z sends out the available bandwidth to upstream router Y 
among others. 

Every intermediate router, on receiving the information, Bi from its down- 
stream router, will perform the following action: 

1. If downstream router is an egress boundary router, go to 3. 

2. If downstream router is an intermediate router, 

(a) from routing table, check if downstream router is a next hop router to 
the egress boundary router. 

(b) If no, disregard update information, Bi . 

(c) If yes, go to 3. 

3. Source for bandwidth available currently at the particular downstream 

link, Ba 
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Fig. 2. Egress boundary router Z monitors the available bandwidth on its down- 
stream link (box represents the bandwidth available for each class on the link) . It 
subsequently sends a packet containing this information to its upstream routers 
(i.e. bandwidth available for class A at router Z is 20 Mbps, for class B at 
router Z is 30 Mbps, etc) 



4. Compare the information Bi with Ba- 

5. li Bi < Ba, the information, Bi is passed unchanged. 

6. If Bi > Ba, then Bi = Ba- 

7. Send Bi to all upstream routers. 

Following our example, as shown in Fig. 3, router Y will compare the infor- 
mation supplied by router Z with the bandwidth available in the downstream 
link (i.e. the link between Y and Z) and make the necessary changes to the infor- 
mation it passes upstream, to router X among others. As shown, the bandwidth 
available in the link for class A (in this case 10 Mbps) is less than that found 
in the information from router Z, ZA-20^, and so is updated to the new value 
(ZA-10) before sending upstream. 

At router X, the same procedure is repeated, only this time, because the 
downstream router is not an egress boundary router, router X checks against 
its routing table if router Y is the next hop of the path to router Z. If it is, 
it compares the information passed with the downstream link and makes the 
necessary changes, as shown in Fig. 4. 

This procedure is repeated until the information reaches the ingress boundary 
router which will update its Resource State Table with the information supplied. 



4 Architectural Model 

Having looked at a broad overview of how DSRUP works, it is clear that DSRUP 
is designed to work on top of the conventional DiffServ architecture. The aim is 

^ Note: ZA-20 means that the bottleneck bandwidth so far for class A on this path to 
router Z is 20 Mbps. 
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Fig. 3. On receiving the information from router Z and knowing that router Z 
is an egress boundary router, router Y checks its downstream link with router Z 
and compares the bandwidth available with that which it receives. The box 
represents the bandwidth available on the link. It makes the necessary changes 
and sends the updated information to it’s upstream routers 




Fig. 4. Since router F is a next-hop router to router Z, on receiving information 
about the path to router Z, router X will source for the bandwidth available 
on the link between router X and Y and make the necessary changes to the 
information. In this case, ZB is updated to 25 and ZC to 40 since the bandwidth 
found here is less than that found in the information received 
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to minimize the additions and modifications needed so as to achieve the right 
balance of added efficiency to the DiffServ model and keeping the objective of 
making DiffServ a simple QoS mechanism useful for large backbone networks. 
In this section, we will define the additional architecture as well as modifications 
needed to construct DSRUP. 

4.1 DSRUP Router 

A DSRUP-enabled router will also, in addition to having functions found in 
existing DS routers, include a resource monitoring function, ft must be able to 
monitor the resource (e.g. load, queue-length, delay) found on its immediate 
link. To reduce its complexity, the protocol does not require a DSRUP-enabled 
interior router to keep a record of these resource values. The values are only 
sourced when an information update packet is received. 

In addition, the router must be able to make comparison between the sets of 
information it has been provided and update the information update packet as 
required. 

4.2 DSRUP Boundary Router 

A boundary router in the conventional DiffServ architecture contains more func- 
tions than its interior counterpart . Similarly, in addition to those found in 
section 4.1, a DSRUP-enabled boundary router also needs to contain more func- 
tions. 



Ingress Router A DSRUP ingress boundary router maintains a Resource State 
Table that reflects the bottleneck resource along all the paths to every egress 
router found in the DS domain. Based on the information found in the table, it 
is able to make informed decisions to achieve efficient resource management of 
the domain. 



Egress Router A DSRUP egress router will periodically monitor and source 
information about the resource of its output and send this information up- 
stream. It may be the form of an information update packet, either by itself 
or “piggy-backed” on some existing protocol (e.g. routing protocol). 

5 Simulations 

To investigate the effectiveness and efficiency of our protocol, we conducted 
several sets of simulation. The simulations are again done using the instance of 
applying DSRUP in admission control decisions. Note that the simulations and 
results in this section are meant to illustrate and evaluate the performance of 
DSRUP and do not represent specific implementations or requirements. 
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5.1 Simulation Model 

The simulations have been performed with ns-2 (version 2.1b5) [13] with mod- 
ifications done to include DiffServ and DSRUP support. The nodes have been 
modified to include DSRUP capabilities of receiving, updating and sending an 
information update packet. Similarly, the boundary nodes are modified to enable 
it to send periodic information update packet. DS conditioner and policer mod- 
ules have also been added. Due to space constraints, details of the modifications 
will not be shown here. 

In the simulation, we want to evaluate the effectiveness of DSRUP in the 
admission control of Assured Forwarding (AF) traffic. We compare the simula- 
tion results between a conventional DiffServ case and a DSRUP-enabled domain 
implementing an admission control algorithm to manage the AF traffic entering 
the domain. 

Fig. 5 shows the topology that was used during the simulation. The 1 Mbps 
link forms the bottleneck link. Each source generates TCP packets that are 
marked as AF traffic. The AF PHB is characterized by different levels of for- 
warding assurance for the IP packets received. A packet from a higher forwarding 
probability level will be forwarded at a higher probability as long as the traffic 
does not exceed the service profile. A packet belonging to a traffic that exceeds 
the service profile will be forwarded at a lower probability. A comprehensive 
description of the AF PHB group can be found in [14]. 




Fig. 5. The topology of the simulation. S1-S20 represent the 20 source hosts 
generating TCP flows that are classified as AF traffic, D1-D20 are the 20 des- 
tination hosts receiving the traffic. Bl-BJ^O are the DSRUP-enabled boundary 
nodes and 11-12 are the DSRUP-enabled interior nodes 



The AF PHB implemented here has only two priority classes. At a source 
host, it is marked by a profiler implemented using a token bucket. For example. 
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if the profile is 50 kbps, packets generated at a rate of more than 50 kbps will be 
marked OUT and those generated within that rate will be marked IN. A policer 
that implements the RIO algorithm [15] is installed in each router before the 
link. The queue length found in every link is 30 packets and the RIO parameters 
we used in this simulation are: 

15/30/0.02 for IN parameters and 5/10/0.05 for OUT parameters^ 

The TCP flows are all infinite FTP sources with starting times randomly 
distributed within 10s. Each source lasts for a randomly distributed period of 
100s. This is repeated over the whole duration of the simulation, generating an 
on-off traffic for each source.^ All packets are fixed size, 1000 bits in length. 



DSRUP Module The egress boundary nodes (B21-B40) periodically send 
out Control Forwarding (CF) packets containing information like the bottle- 
neck bandwidth (for every service class) thus far in the domain, the address of 
the egress node, the address of the previous hop, etc. These packets are small 
in size but are treated with the highest priority. In our simulation, each egress 
boundary node sends out a CF packet every 0.1s. 

Upon receiving a CF packet, every node will source for the bandwidth avail- 
able on its downstream link and make the necessary changes to the packet before 
sending it upstream. At the ingress boundary nodes (B1-B20), each node keeps 
a record of the bottleneck bandwidth along every path to the destination (which 
is at the bottleneck link, 11-12 in this case). 

Admission Control Module For the admission control portion of the simula- 
tion, we implement a simple admission control algorithm. Before the host sends 
out a new burst of FTP traffic (we define that to be a new flow), it requests 
permission from the boundary node. The boundary node, on checking with its 
record of the bottleneck link, will make the decision of accepting or rejecting the 
request. 

If the new flow is accepted, the host will proceed to mark the packets accord- 
ingly to conform to its predetermined profile. If the flow is rejected, the host will 
wait for a random period within 10s before making a new request. 

5.2 Results and Discussion 

This section shows results of our simulations to evaluate the relative performance 
of DSRUP. We run the simulation model described above for a run time of 500s 
and 1000s. For each model (conventional and DSRUP-enabled) , the simulation 
was done with each source having a profile 40 kbps, 50 kbps and 60 kbps of traffic, 

® The notation here means Minimum Queue Threshold/ Maximum Queue Thresh- 
old/Maximum Drop Probability. 

^ Though hardly a true reflection of any present Internet traffic, this will nevertheless 
give us a rough idea of how the protocol performs. 
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corresponding to 80%, 100% and 120% loading respectively. Note that due to 
their adaptive nature, TCP flows will continue to increase their transmission 
rates beyond the profiles set, until a packet is dropped. In all, 20 simulation runs 
were done for each run time and the average calculated. 

Fig. 6 shows the results of the overall rate achieved at the end of the simula- 
tion. It can be seen that DSRUP-enabled admission control allows for a better 
utilization of the bandwidth in the network. The boundary nodes only allow new 
flows to enter the network if the bottleneck bandwidth available is more than 
the profiled rate of a new flow. This prevents the flow from being admitted only 
to have many of its packets dropped along the path. Note also that loading the 
network only marginally degrades the performance in terms of achieved rate. 

Fig. 7 shows the amount of IN packets transmitted as a percentage of the total 
number of packets (IN and OUT). DSRUP-enabled admission control allows for 
a higher percentage of IN packets to be transmitted over the bottleneck link. 
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Fig. 6. Graph showing the achieved 
AF rate for conventional DiffServ and 
DSRUP DiffServ over 500s and 1000s 
simulation run for 80%, 100% and 
120% loading 



Fig. 7. Graph showing the percentage 
of total packets transmitted that were 
IN packets for conventional DiffServ 
and DSRUP DiffServ over 500s and 
1000s simulation run for 80%, 100% 
and 120% loading 



Fig. 8 shows the percentage of total packets that were dropped at the bot- 
tleneck link during the simulation run. We note that for both conventional and 
DSRUP DiffServ networks, only the OUT packets are dropped. This is in agree- 
ment with the performance of a typical DiffServ network, where packets with 
lower forwarding probability (the OUT packets in this case) are dropped first 
in the congested link. It can be seen that for all three loading conditions, the 
amount of dropped packets at the bottleneck link is reduced in the DSRUP case. 
This translates to better utilization of resources in the network. 
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Similarly, Fig. 9 shows the percentage of total packets that were retrans- 
mitted during the simulation run. The packets were retransmitted because of a 
retransmission timeout occurring in the TCP layer, due either to dropped pack- 
ets or excessive delays in the queue. The amount of retransmitted packets was 
also less in the DSRUP case compared to conventional DiffServ. 




5 



S Conventional DiffServ - 500s 
□ DSRUP DiffServ -500s 
0 Conventional DiffServ ■ 1000s 
^ DSRUP DiffServ - 1000s 




80 100 120 

Loading / % 



Fig. 8. Graph showing the percent- 
age of total packets that were dropped 
in the simulation run for conventional 
DiffServ and DSRUP DiffServ over 
500s and 1000s simulation run for 80%, 
100% and 120% loading 



Fig. 9. Graph showing the percent- 
age of total packets that were retrans- 
mitted for conventional DiffServ and 
DSRUP DiffServ over 500s and 1000s 
simulation run for 80%, 100% and 
120% loading 



The above simulation results are intended to show that DSRUP can be used 
to provide effective resource management. Overall, the utilization of the resource 
within a DiffServ network can be improved. This is just a brief study of the effec- 
tiveness and efficiency of DSRUP. We are currently conducting more extensive 
simulations to further investigate the strengths and weaknesses of DSRUP and 
the results will be the subject of a future paper. 

6 Conclusion 

In this paper, we have introduced a resource updating protocol to be imple- 
mented in the DiffServ domain to allow for effective resource management within 
the domain. A survey of literature and activities in the DiffServ mailing list have 
revealed that in order for the DiffServ model to be practical, there is a need to 
provide additional mechanisms, especially in the management of the resources 
in the network. We must note however, that in providing for these mechanisms, 
we do not add unnecessary complexities to the existing DiffServ architecture, 
so as not to destroy the fundamental philosophy of DiffServ. Hence any scheme 
should minimize signaling and data recording. 
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In DSRUP, we hope to provide such a scheme. The aim is to make the signal- 
ing within the network core as simple and minimum as possible and add complex- 
ity only at the edge. The initial simulations have been positive in showing that 
DSRUP does indeed improve the resource utilization within the network. The 
tradeoff is of course the additional architecture and signaling involved. Investi- 
gations into the strengths and weaknesses of our protocol and its performance 
under different network conditions as well as PHB types are the subject of our 
ongoing research. 
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Abstract. Price and quality differentiation are valuable tools that can 
provide higher revenues and increase utilization efficiency of a network, 
and thus in general increase social welfare. Such measures, most notice- 
able in airline pricing, are spreading to many services and products, es- 
pecially high-tech ones. However, it is questionable whether they should 
or ever will be used widely in Internet transport. The main application 
of QoS techniques, if any, is likely to be in access links, either because 
resource constraints create an especially strong case for them (as may be 
true in some wireless connections), or for price discrimination purposes. 
However, in the photonic back-bones of the Internet it is best to provide 
uniformly high quality through low utilization. 

The main problem with most QoS techniques is that they require sub- 
stantial in-volvement of the end users. When one considers the costs of 
the entire system, the seeming inefficiency of lightly utilized backbones 
pales next to the savings in engineering and operations of the rest of the 
information processing system (which includes far more than just the 
network). This argument is supported by historical evidence. The trend 
in a variety of communication services has been to pay more attention to 
user preferences and less to network efficiency as the service evolved. 

An additional factor that militates against QoS is that user utility is de- 
rived primarily from low transaction latency. That is what leads to low 
utilization, and makes most QoS techniques irrelevant. 
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Abstract. Commercial provisioning of QoS enhanced IP services is not 
yet reality on a per-user basis. This panel plans to identify open issues 
that need to be resolved for commercial QoS offerings. Open issues in- 
clude: Charging in multi-provider scenarios. Authentication, Authorisa- 
tion and Accounting, Tariff dimensioning, scalable metering of re- 
source-usage. How to address the open issues significantly depends on 
the selected approach on charging for QoS. The following, to a large 
extend opposing positions can be identified for suitable approaches to 
charge for Internet services. 



1 Arguments for QoS-Insensitive Charging of Internet 
Transport 

According to the viewpoint advocated by Odlyzko [1], QoS will not play a significant 
role in charging for Internet transport. One reason is the trend towards low network 
utilisation [2]. In such a situation, the possible improvement of QoS mechanisms is 
low, while their complexity adds significantly to network costs. Therefore, it is likely 
that the Internet transport will be dominated by best-effort services. As a consequence, 
charging schemes can be kept very simple. Odlyzko sees cases for QoS mechanisms, 
together with possible success of QoS-sensitive charging, only for access the access 
part of the network. For access networks, resource constraints may be unavoidable (in 
particular for wireless access), and the market is likely to allow for price discrimina- 
tion. 



2 Differentiation of Network-to-Network Charging and End 
User Charging - The Dual Charge Approach 

According to the Dual-Charge viewpoint, as advocated by Wolisz [3], completely 
different charging approaches have to be selected for the cases of (a) charging be- 
tween different providers, and (b) charging of end users [4]. Charging between differ- 
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ent network service providers can be kept simple and based on network metrics. Vol- 
ume based charging, where the price of different QoS classes would vary according to 
the QoS level, would satisfy the requirement for network-to-network charging. On the 
other hand, end user charging should not be based on network level metrics, but on 
application metrics of which users are aware of End user charging schemata must 
reflect application service semantics. This avoids that end users get charged based on 
usage data such as byte volume that is highly dependent on specifics of lower layers. 



3 Subjective Assessment of Audio and Video Quality 

The work of Sasse [5] focuses on the factors that influence the subjective assessment 
of audio and video quality. As previous research has shown there exists no direct 
relationship between subjective assessment of audio and video quality, and objective 
QoS parameters. Therefore, the factors that affect subjective quality assessment have 
to be considered for end user charging. Recent work on the subjective assessment of 
audio and video quality [6] has shown that factors such as users' task and level of 
experience and whether users are required to pay for that quality significantly influ- 
ence subjective ratings of the same objective quality. The work by Sasse establishes a 
relationship between the media quality experienced by users of networked multimedia 
applications and user cost, which is defined as stress. Advantages can be expected if 
charging schemes for audio and video take this relationship into account. 



4 Lightweight Policing and Charging 

According to the viewpoint of Briscoe [7] et ah, QoS differentiation may be achieved 
by combining a lightweight, packet-granularity based charging of end users with a 
simple network that provides classification and scheduling at routers, but no policing 
[8]. In this approach, the charging functionality emulates policing. It has the advan- 
tages that the charging functionality can be implemented mostly within the end system, 
and can be separated from the data path. Functions that can be implemented within 
customer systems include metering, accounting and billing and also per-packet or per- 
flow policing and admission control. The resulting simplicity of the provider network 
allows for lower costs. As shown in [8], the approach is suitable for inter-provider 
charging, multicast charging and open bundling of network charges with those for 
higher class services. 



5 Relation of Tariffs and Network Dimensioning 

In the viewpoint of Roberts [9], the most important element for providing QoS is 
admission control. A number of difficulties arise when a providers offers different 
quality levels on a per packet level as e.g. according to the DiffServ approach. Moti- 
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vation of this viewpoint is a multiservice network provider who wishes to offer users 
quality of service guarantees concerning transparency, accessibility and through- 
put [10], These guarantees can be respected jointly by the definition of a service 
model and by the provision of capacity in relation to demand. The feasibility of ra- 
tional network provisioning depends significantly on the charging scheme employed. 
Under these conditions, a uniform charging scheme, applying the same per bit rate for 
stream and elastic traffic, is to be preferred for present purposes to a differential 
charging scheme where users can choose to pay more for better quality. 



6 Network Flow Control by Congestion-Based Pricing 

According to the viewpoint advocated by Courcoubetis [11], [12] feedback in flow 
control is crucial for the scalable and robust evolution of the Internet, and congestion 
prices are the right type of feedback signal. Such mechanisms require intelligence at 
the edges of the network and users need not explicitly know prices. Building at the 
core of the network such signalling mechanisms reduces the need for inflexible call- 
admission procedures, and provides simple means for supporting differentiated quality 
of service and for reducing the risk of congestion due to waste of network resources. 
As content will be bundled with transport, customers will not directly pay for trans- 
port. But unless the right incentives are provided, network resources will be wasted 
and the valuable applications will not perform adequately. This motivates the design 
of charging mechanisms that use prices as an internal control mechanism and collect 
charges from the parties that obtain value from the use of the network. 



7 Audio and Video Applications with Adaptive Reservation and 
Congestion Pricing 

Schulzrinne [13] and Wang [14] consider the viewpoint of a network with QoS sup- 
port and congestion-based pricing. They show that that for audio and video applica- 
tions with adaptive reservation, where the demand behaviour of adaptive users is 
based on a physically reasonable user utility function, congestion-based pricing can 
result in advantages over fixed pricing for all users even if only a fraction of the appli- 
cations are adaptive. The congestion-sensitive pricing system takes advantage of ap- 
plication adaptivity to achieve significant gains in network availability, revenue, and 
user-perceived benefit relative to the fixed-price policy. In the considered scenario, 
users with different demand elasticity are seen to share bandwidth fairly, with each 
user having a bandwidth share proportional to its relative willingness to pay for band- 
width. The results also show that even a small proportion of adaptive users may result 
in a significant performance benefit and better service for the entire user population - 
both adaptive and non-adaptive users. 
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8 Charging and Accounting Infrastructure 

Commercial provisioning of QoS enhanced IP services in a multi-provider scenario 
implies network nodes must support some or all of the following functionality [15]: 
authentication and authorization, metering, collecting and accounting of network re- 
source usage, tariff information distribution, billing, and settlement. Until now, key 
components for setting up a generic, scalable charging and accounting infrastructure 
are missing. Among these components are configurable, high-performance meters, and 
protocol mechanisms for their coordination for providing an interdomain accounting 
service. The design for this infrastructure for interdomain charging and accounting 
depends on which of the presented viewpoints for charging becomes a viable business 
case. 
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